How to Find the Target File's Full(Absolute Path) of the Symbolic Link or Soft Link in Python

Python pathlib: Resolve full path to symbolic link without following it

pathlib.Path has an absolute method that does what you want.

$  mkdir folder
$  touch target
$  ln -s ~/target ~/folder/link
$  ls -l folder/
total 0
lrwxrwxrwx 1 me users 16 Feb 20 19:47 link -> /home/me/target
$  cd folder

$/folder  python3.7 -c 'import os.path;print(os.path.abspath("link"))'
/home/me/folder/link

$/folder  python3.7 -c 'import pathlib;p = pathlib.Path("link");print(p.absolute())'
/home/me/folder/link

The method doesn't appear in the module documentation, but its docstring reads:

Return an absolute version of this path. This function works
even if the path doesn't point to anything.
No normalization is done, i.e. all '.' and '..' will be kept along.
Use resolve() to get the canonical path to a file.

It's worth noting that there are comments in the method code (in the 3.7 branch) that suggest it may not be fully tested on all platforms.

Determine a file's path(s) relative to a directory, including symlinks

This, like many things, is more complex than it might appear on the surface.

Each entity in the file system points at an inode, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...

The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.

It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.

It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.

Before we get into this... some comments:

See the end for some benchmarks. I'm not convinced that this is a significant issue, though admittedly this filesystem is on a 6-disk ZFS array, on an i7, so using a lower spec system will take longer...
Given that this is impossible without calling stat() on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)

As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...

To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.

The following function will produce this structure for us:

def get_map(scan_root):
    # this dict will have device IDs at the first level (major / minor) ...
    # ... and inodes IDs at the second level
    # each inode will have the following keys:
    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...
    #   - 'links'    a list of all found hard links to the inode
    #   - 'symlinks' a list of all found symlinks to the inode
    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756
    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756
    entity_map = {}

    for root, dirs, files in os.walk(scan_root):
        root = '.' + root[len(scan_root):]
        for path in [ os.path.join(root, _) for _ in files ]:
            try:
                p_stat = os.stat(path)
            except OSError as e:
                if e.errno == 2:
                    print('Broken symlink [%s]... skipping' % ( path ))
                    continue
                if e.errno == 40:
                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))
                    continue
                raise

            p_dev = p_stat.st_dev
            p_ino = p_stat.st_ino

            if p_dev not in entity_map:
                entity_map[p_dev] = {}
            e_dev = entity_map[p_dev]

            if p_ino not in e_dev:
                e_dev[p_ino] = {
                    'type': get_type(p_stat.st_mode),
                    'links': [],
                    'symlinks': [],
                }
            e_ino = e_dev[p_ino]

            if os.lstat(path).st_ino == p_ino:
                e_ino['links'].append(path)
            else:
                e_ino['symlinks'].append(path)

    return entity_map

I've produced an example tree that looks like this:

$ tree --inodes
.
├── [  67687]  4 -> 5
├── [  67676]  5 -> 4
├── [  67675]  6 -> dead
├── [  67676]  a
│   └── [  67679]  1
├── [  67677]  b
│   └── [  67679]  2 -> ../a/1
├── [  67678]  c
│   └── [  67679]  3
└── [  67687]  d
    └── [  67688]  4

4 directories, 7 files

The output of this function is:

$ places
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{201: {67679: {'links': ['./a/1', './c/3'],
               'symlinks': ['./b/2'],
               'type': 'file'},
       67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}

If we are interested in ./c/3, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1...

By subsequently searching for the path we are interested in, we can find all other references within this tree:

def filter_map(entity_map, filename):
    for dev, inodes in entity_map.items():
        for inode, info in inodes.items():
            if filename in info['links'] or filename in info['symlinks']:
                return info

$ places ./a/1
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}

The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link... that's an exercise for the reader.

It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify to deal with this nicely - there's even a python module.

#!/usr/bin/env python3

import os, sys, stat
from pprint import pprint

def get_type(mode):
    if stat.S_ISDIR(mode):
        return 'directory'
    if stat.S_ISCHR(mode):
        return 'character'
    if stat.S_ISBLK(mode):
        return 'block'
    if stat.S_ISREG(mode):
        return 'file'
    if stat.S_ISFIFO(mode):
        return 'fifo'
    if stat.S_ISLNK(mode):
        return 'symlink'
    if stat.S_ISSOCK(mode):
        return 'socket'
    return 'unknown'

def get_map(scan_root):
    # this dict will have device IDs at the first level (major / minor) ...
    # ... and inodes IDs at the second level
    # each inode will have the following keys:
    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...
    #   - 'links'    a list of all found hard links to the inode
    #   - 'symlinks' a list of all found symlinks to the inode
    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756
    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756
    entity_map = {}

    for root, dirs, files in os.walk(scan_root):
        root = '.' + root[len(scan_root):]
        for path in [ os.path.join(root, _) for _ in files ]:
            try:
                p_stat = os.stat(path)
            except OSError as e:
                if e.errno == 2:
                    print('Broken symlink [%s]... skipping' % ( path ))
                    continue
                if e.errno == 40:
                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))
                    continue
                raise

            p_dev = p_stat.st_dev
            p_ino = p_stat.st_ino

            if p_dev not in entity_map:
                entity_map[p_dev] = {}
            e_dev = entity_map[p_dev]

            if p_ino not in e_dev:
                e_dev[p_ino] = {
                    'type': get_type(p_stat.st_mode),
                    'links': [],
                    'symlinks': [],
                }
            e_ino = e_dev[p_ino]

            if os.lstat(path).st_ino == p_ino:
                e_ino['links'].append(path)
            else:
                e_ino['symlinks'].append(path)

    return entity_map

def filter_map(entity_map, filename):
    for dev, inodes in entity_map.items():
        for inode, info in inodes.items():
            if filename in info['links'] or filename in info['symlinks']:
                return info

entity_map = get_map(os.getcwd())

if len(sys.argv) == 2:
    entity_info = filter_map(entity_map, sys.argv[1])
    pprint(entity_info)
else:
    pprint(entity_map)

I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...

Some benchmarks to consider:

A dataset of ~3.1k files and links in ~850 directories.
This runs in less than 3.5 seconds, ~80ms on subsequent runs
A dataset of ~30k files and links in ~2.2k directories.
This runs in less than 30 seconds, ~300ms on subsequent runs
A dataset of ~73.5k files and links in ~8k directories.
This runs in approx 60 seconds, ~800ms on subsequent runs

Using simple maths, that's about 1140 stat() calls per second with an empty cache, or ~90k stat() calls per second once the cache has been filled - I don't think that stat() is as slow as you think it is!

How do I access in Python the link text of a symlink?

You can use os.readlink(path) to find the linked file of a symbolic link. From the documentation:

Return a string representing the path to which the symbolic link points.

Also check out this more detailed answer in a related SO question: https://stackoverflow.com/a/42426912/3628578

How to see full absolute path of a symlink

realpath isn't available on all linux flavors, but readlink should be.

readlink -f symlinkName

The above should do the trick.

Alternatively, if you don't have either of the above installed, you can do the following if you have python 2.6 (or later) installed

python -c 'import os.path; print(os.path.realpath("symlinkName"))'

How to Find the Target File's Full(Absolute Path) of the Symbolic Link or Soft Link in Python

Python pathlib: Resolve full path to symbolic link without following it

Determine a file's path(s) relative to a directory, including symlinks

How do I access in Python the link text of a symlink?

How to see full absolute path of a symlink

Related Topics

Leave a reply