Linux: Find All Symlinks of a Given 'Original' File? (Reverse 'Readlink')

Linux: Find all symlinks of a given 'original' file? (reverse 'readlink')

I've not seen a command for this and it's not an easy task, since the target file contains zero information on what source files point to it.

This is similar to "hard" links but at least those are always on the same file system so you can do a find -inode to list them. Soft links are more problematic since they can cross file systems.

I think what you're going to have to do is basically perform an ls -al on every file in your entire hierarchy and use grep to search for -> /path/to/target/file.

For example, here's one I ran on my system (formatted for readability - those last two lines are actually on one line in the real output):

pax$ find / -exec ls -ald {} ';' 2>/dev/null | grep '\-> /usr/share/applications'
lrwxrwxrwx 1 pax pax 23 2010-06-12 14:56 /home/pax/applications_usr_share
                                         -> /usr/share/applications

symbolic link: find all files that link to this file

It depends, if you are trying to find links to a specific file that is called foo.txt, then this is the only good way:

find -L / -samefile path/to/foo.txt

On the other hand, if you are just trying to find links to any file that happens to be named foo.txt, then something like

find / -lname foo.txt

find . -lname \*foo.txt # ignore leading pathname components

How to resolve symbolic links in a shell script

According to the standards, pwd -P should return the path with symlinks resolved.

C function char *getcwd(char *buf, size_t size) from unistd.h should have the same behaviour.

getcwd
pwd

Determine a file's path(s) relative to a directory, including symlinks

This, like many things, is more complex than it might appear on the surface.

Each entity in the file system points at an inode, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...

The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.

It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.

It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.

Before we get into this... some comments:

See the end for some benchmarks. I'm not convinced that this is a significant issue, though admittedly this filesystem is on a 6-disk ZFS array, on an i7, so using a lower spec system will take longer...
Given that this is impossible without calling stat() on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)

As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...

To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.

The following function will produce this structure for us:

def get_map(scan_root):
    # this dict will have device IDs at the first level (major / minor) ...
    # ... and inodes IDs at the second level
    # each inode will have the following keys:
    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...
    #   - 'links'    a list of all found hard links to the inode
    #   - 'symlinks' a list of all found symlinks to the inode
    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756
    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756
    entity_map = {}

    for root, dirs, files in os.walk(scan_root):
        root = '.' + root[len(scan_root):]
        for path in [ os.path.join(root, _) for _ in files ]:
            try:
                p_stat = os.stat(path)
            except OSError as e:
                if e.errno == 2:
                    print('Broken symlink [%s]... skipping' % ( path ))
                    continue
                if e.errno == 40:
                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))
                    continue
                raise

            p_dev = p_stat.st_dev
            p_ino = p_stat.st_ino

            if p_dev not in entity_map:
                entity_map[p_dev] = {}
            e_dev = entity_map[p_dev]

            if p_ino not in e_dev:
                e_dev[p_ino] = {
                    'type': get_type(p_stat.st_mode),
                    'links': [],
                    'symlinks': [],
                }
            e_ino = e_dev[p_ino]

            if os.lstat(path).st_ino == p_ino:
                e_ino['links'].append(path)
            else:
                e_ino['symlinks'].append(path)

    return entity_map

I've produced an example tree that looks like this:

$ tree --inodes
.
├── [  67687]  4 -> 5
├── [  67676]  5 -> 4
├── [  67675]  6 -> dead
├── [  67676]  a
│   └── [  67679]  1
├── [  67677]  b
│   └── [  67679]  2 -> ../a/1
├── [  67678]  c
│   └── [  67679]  3
└── [  67687]  d
    └── [  67688]  4

4 directories, 7 files

The output of this function is:

$ places
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{201: {67679: {'links': ['./a/1', './c/3'],
               'symlinks': ['./b/2'],
               'type': 'file'},
       67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}

If we are interested in ./c/3, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1...

By subsequently searching for the path we are interested in, we can find all other references within this tree:

def filter_map(entity_map, filename):
    for dev, inodes in entity_map.items():
        for inode, info in inodes.items():
            if filename in info['links'] or filename in info['symlinks']:
                return info

$ places ./a/1
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}

The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link... that's an exercise for the reader.

It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify to deal with this nicely - there's even a python module.

#!/usr/bin/env python3

import os, sys, stat
from pprint import pprint

def get_type(mode):
    if stat.S_ISDIR(mode):
        return 'directory'
    if stat.S_ISCHR(mode):
        return 'character'
    if stat.S_ISBLK(mode):
        return 'block'
    if stat.S_ISREG(mode):
        return 'file'
    if stat.S_ISFIFO(mode):
        return 'fifo'
    if stat.S_ISLNK(mode):
        return 'symlink'
    if stat.S_ISSOCK(mode):
        return 'socket'
    return 'unknown'

def get_map(scan_root):
    # this dict will have device IDs at the first level (major / minor) ...
    # ... and inodes IDs at the second level
    # each inode will have the following keys:
    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...
    #   - 'links'    a list of all found hard links to the inode
    #   - 'symlinks' a list of all found symlinks to the inode
    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756
    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756
    entity_map = {}

    for root, dirs, files in os.walk(scan_root):
        root = '.' + root[len(scan_root):]
        for path in [ os.path.join(root, _) for _ in files ]:
            try:
                p_stat = os.stat(path)
            except OSError as e:
                if e.errno == 2:
                    print('Broken symlink [%s]... skipping' % ( path ))
                    continue
                if e.errno == 40:
                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))
                    continue
                raise

            p_dev = p_stat.st_dev
            p_ino = p_stat.st_ino

            if p_dev not in entity_map:
                entity_map[p_dev] = {}
            e_dev = entity_map[p_dev]

            if p_ino not in e_dev:
                e_dev[p_ino] = {
                    'type': get_type(p_stat.st_mode),
                    'links': [],
                    'symlinks': [],
                }
            e_ino = e_dev[p_ino]

            if os.lstat(path).st_ino == p_ino:
                e_ino['links'].append(path)
            else:
                e_ino['symlinks'].append(path)

    return entity_map

def filter_map(entity_map, filename):
    for dev, inodes in entity_map.items():
        for inode, info in inodes.items():
            if filename in info['links'] or filename in info['symlinks']:
                return info

entity_map = get_map(os.getcwd())

if len(sys.argv) == 2:
    entity_info = filter_map(entity_map, sys.argv[1])
    pprint(entity_info)
else:
    pprint(entity_map)

I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...

Some benchmarks to consider:

A dataset of ~3.1k files and links in ~850 directories.
This runs in less than 3.5 seconds, ~80ms on subsequent runs
A dataset of ~30k files and links in ~2.2k directories.
This runs in less than 30 seconds, ~300ms on subsequent runs
A dataset of ~73.5k files and links in ~8k directories.
This runs in approx 60 seconds, ~800ms on subsequent runs

Using simple maths, that's about 1140 stat() calls per second with an empty cache, or ~90k stat() calls per second once the cache has been filled - I don't think that stat() is as slow as you think it is!

How do I know the script file name in a Bash script?

me=`basename "$0"`

For reading through a symlink¹, which is usually not what you want (you usually don't want to confuse the user this way), try:

me="$(basename "$(test -L "$0" && readlink "$0" || echo "$0")")"

IMO, that'll produce confusing output. "I ran foo.sh, but it's saying I'm running bar.sh!? Must be a bug!" Besides, one of the purposes of having differently-named symlinks is to provide different functionality based on the name it's called as (think gzip and gunzip on some platforms).

¹ That is, to resolve symlinks such that when the user executes foo.sh which is actually a symlink to bar.sh, you wish to use the resolved name bar.sh rather than foo.sh.

How do I get the directory where a Bash script is located from within the script itself?

#!/usr/bin/env bash

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

is a useful one-liner which will give you the full directory name of the script no matter where it is being called from.

It will work as long as the last component of the path used to find the script is not a symlink (directory links are OK). If you also want to resolve any links to the script itself, you need a multi-line solution:

#!/usr/bin/env bash

SOURCE=${BASH_SOURCE[0]}
while [ -L "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
  DIR=$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )
  SOURCE=$(readlink "$SOURCE")
  [[ $SOURCE != /* ]] && SOURCE=$DIR/$SOURCE # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
DIR=$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )

This last one will work with any combination of aliases, source, bash -c, symlinks, etc.

Beware: if you cd to a different directory before running this snippet, the result may be incorrect!

Also, watch out for $CDPATH gotchas, and stderr output side effects if the user has smartly overridden cd to redirect output to stderr instead (including escape sequences, such as when calling update_terminal_cwd >&2 on Mac). Adding >/dev/null 2>&1 at the end of your cd command will take care of both possibilities.

To understand how it works, try running this more verbose form:

#!/usr/bin/env bash

SOURCE=${BASH_SOURCE[0]}
while [ -L "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
  TARGET=$(readlink "$SOURCE")
  if [[ $TARGET == /* ]]; then
    echo "SOURCE '$SOURCE' is an absolute symlink to '$TARGET'"
    SOURCE=$TARGET
  else
    DIR=$( dirname "$SOURCE" )
    echo "SOURCE '$SOURCE' is a relative symlink to '$TARGET' (relative to '$DIR')"
    SOURCE=$DIR/$TARGET # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
  fi
done
echo "SOURCE is '$SOURCE'"
RDIR=$( dirname "$SOURCE" )
DIR=$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )
if [ "$DIR" != "$RDIR" ]; then
  echo "DIR '$RDIR' resolves to '$DIR'"
fi
echo "DIR is '$DIR'"

And it will print something like:

SOURCE './scriptdir.sh' is a relative symlink to 'sym2/scriptdir.sh' (relative to '.')
SOURCE is './sym2/scriptdir.sh'
DIR './sym2' resolves to '/home/ubuntu/dotfiles/fo fo/real/real1/real2'
DIR is '/home/ubuntu/dotfiles/fo fo/real/real1/real2'

Linux: Find All Symlinks of a Given 'Original' File? (Reverse 'Readlink')