Python pathlib: Resolve full path to symbolic link without following it
pathlib.Path
has an absolute
method that does what you want.
$ mkdir folder
$ touch target
$ ln -s ~/target ~/folder/link
$ ls -l folder/
total 0
lrwxrwxrwx 1 me users 16 Feb 20 19:47 link -> /home/me/target
$ cd folder
$/folder python3.7 -c 'import os.path;print(os.path.abspath("link"))'
/home/me/folder/link
$/folder python3.7 -c 'import pathlib;p = pathlib.Path("link");print(p.absolute())'
/home/me/folder/link
The method doesn't appear in the module documentation, but its docstring reads:
Return an absolute version of this path. This function works
even if the path doesn't point to anything.
No normalization is done, i.e. all '.' and '..' will be kept along.
Use resolve() to get the canonical path to a file.
It's worth noting that there are comments in the method code (in the 3.7 branch) that suggest it may not be fully tested on all platforms.
Determine a file's path(s) relative to a directory, including symlinks
This, like many things, is more complex than it might appear on the surface.
Each entity in the file system points at an inode
, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...
The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.
It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.
It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.
Before we get into this... some comments:
- See the end for some benchmarks. I'm not convinced that this is a significant issue, though admittedly this filesystem is on a 6-disk ZFS array, on an i7, so using a lower spec system will take longer...
- Given that this is impossible without calling
stat()
on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)
As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...
To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.
The following function will produce this structure for us:
def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}
for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise
p_dev = p_stat.st_dev
p_ino = p_stat.st_ino
if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]
if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]
if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)
return entity_map
I've produced an example tree that looks like this:
$ tree --inodes
.
├── [ 67687] 4 -> 5
├── [ 67676] 5 -> 4
├── [ 67675] 6 -> dead
├── [ 67676] a
│ └── [ 67679] 1
├── [ 67677] b
│ └── [ 67679] 2 -> ../a/1
├── [ 67678] c
│ └── [ 67679] 3
└── [ 67687] d
└── [ 67688] 4
4 directories, 7 files
The output of this function is:
$ places
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{201: {67679: {'links': ['./a/1', './c/3'],
'symlinks': ['./b/2'],
'type': 'file'},
67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}
If we are interested in ./c/3
, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1
...
By subsequently searching for the path we are interested in, we can find all other references within this tree:
def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info
$ places ./a/1
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}
The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link
... that's an exercise for the reader.
It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify
to deal with this nicely - there's even a python module.
#!/usr/bin/env python3
import os, sys, stat
from pprint import pprint
def get_type(mode):
if stat.S_ISDIR(mode):
return 'directory'
if stat.S_ISCHR(mode):
return 'character'
if stat.S_ISBLK(mode):
return 'block'
if stat.S_ISREG(mode):
return 'file'
if stat.S_ISFIFO(mode):
return 'fifo'
if stat.S_ISLNK(mode):
return 'symlink'
if stat.S_ISSOCK(mode):
return 'socket'
return 'unknown'
def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}
for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise
p_dev = p_stat.st_dev
p_ino = p_stat.st_ino
if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]
if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]
if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)
return entity_map
def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info
entity_map = get_map(os.getcwd())
if len(sys.argv) == 2:
entity_info = filter_map(entity_map, sys.argv[1])
pprint(entity_info)
else:
pprint(entity_map)
I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...
Some benchmarks to consider:
- A dataset of ~3.1k files and links in ~850 directories.
This runs in less than 3.5 seconds, ~80ms on subsequent runs - A dataset of ~30k files and links in ~2.2k directories.
This runs in less than 30 seconds, ~300ms on subsequent runs - A dataset of ~73.5k files and links in ~8k directories.
This runs in approx 60 seconds, ~800ms on subsequent runs
Using simple maths, that's about 1140 stat()
calls per second with an empty cache, or ~90k stat()
calls per second once the cache has been filled - I don't think that stat()
is as slow as you think it is!
How do I access in Python the link text of a symlink?
You can use os.readlink(path)
to find the linked file of a symbolic link. From the documentation:
Return a string representing the path to which the symbolic link points.
Also check out this more detailed answer in a related SO question: https://stackoverflow.com/a/42426912/3628578
How to see full absolute path of a symlink
realpath
isn't available on all linux flavors, but readlink
should be.
readlink -f symlinkName
The above should do the trick.
Alternatively, if you don't have either of the above installed, you can do the following if you have python 2.6 (or later) installed
python -c 'import os.path; print(os.path.realpath("symlinkName"))'
Related Topics
How to Plot Multiple Dataframes in Subplots
Python Error - "Importerror: Cannot Import Name 'Dist'"
Letsencrypt Importerror: No Module Named Interface on Amazon Linux While Renewing
Why am I Getting Socket.Gaierror: [Errno -2] from Python Httplib
Can Python Detect Which Os Is It Running Under
Histogram of an Image's "Black Ink Level" by Horizontal Axis
(Still) Cannot Properly Install Lxml 2.3 for Python, But at Least 2.2.8 Works
Global Keybinding on X Using Python Gtk3
How to Get the Process Name by Pid in Linux Using Python
How to Solve Unicodedecodeerror in Python 3.6
Python Multiprocessing - Debugging Oserror: [Errno 12] Cannot Allocate Memory
Python & Ms Word: Convert .Doc to .Docx
How to Get Apache to Serve Static Files on Flask Webapp
Using Pyinotify to Watch for File Creation, But Waiting for It to Be Completely Written to Disk