How to Get the Filename Along with Absolute Path to the File, Whenever a New File Is Created Using Inode in Linux

how to get the filename along with absolute path to the file, whenever a new file is created using inode in linux?

It seems that you want to monitor a directory for changes. This is a complex job, but for which there are good modules. The easiest one to recommend is probably Linux::Inotify2

This module implements an interface to the Linux 2.6.13 and later Inotify file/directory change notification system.

This seems to be along the lines of what you wanted.

Any such monitor needs additional event handling. This example uses AnyEvent.

use warnings;
use strict;
use feature 'say';

use AnyEvent;
use Linux::Inotify2;

my $dir = 'dir_to_watch';

my $inotify = Linux::Inotify2->new or die "Can't create inotify object: $!";

$inotify->watch( $dir, IN_MODIFY | IN_CREATE, sub {
my $e = shift;
my $name = $e->fullname;
say "$name modified" if $e->IN_MODIFY; # Both show the new file
say "$name created" if $e->IN_CREATE; # but see comments below
});

my $inotify_w = AnyEvent->io (
fh => $inotify->fileno, poll => 'r', cb => sub { $inotify->poll }
);

1 while $inotify->poll;

If you only care about new files then you only need one constant above.
For both types of events the $name has the name of the new file. From man inotify on my system

... the name field in the returned inotify_event structure identifies the name of the file within the directory.

The inotify_event structure is suitably represented by a Linux::Inotify2::Watcher object.

Using IN_CREATE seems to be an obvious solution for your purpose. I tested by creating two files, with two redirected echo commands separated by semi-colon on the same command line, and also by touch-ing a file. The written files are detected as separate events, and so is the touch-ed file.

Using IN_MODIFY may also work since it monitors (in $dir)

... any file system object in the watched object (always a directory), that is files, directories, symlinks, device nodes etc. ...

As for tests, both files written by echo as above are reported, as separate events. But a touch-ed file is not reported, since data didn't change (the file wasn't written to).

Which is better suited for your need depends on details. For example, a tool may open a log file as it starts, only to write to it much later. The two ways above will behave differently in that case. All this should be investigated carefully under your specific conditions.

We may think of a race condition, since while the code executes other file(s) could slip in. But the module is far better than that and it does report new changes after the handler completes. I tested by creating files while that code runs (and sleeps) and they are reported.

Some other notable frameworks for event-driven programming are POE and IO::Async.

The File::Monitor does this kind of work, too.

How to get full path of a file?

Use readlink:

readlink -f file.txt

Unix File System: How are file names translated to disk sectors?

In addition of tmp's answer:

Files really are inodes.

Usually, a given file has some entry in some directory pointing to its inode. Directories are mapping names to inodes, and directories are a kind of file. See stat(2) for what an inode contains (and can be queried by application code), in particular for the various file types (plain file, directory, char or block device, fifo, symlink, ....). So a directory is often a dictionnary (implemented in various, file system specific, ways) mapping strings to inodes. So in the directory /bin/ there is usually an entry associating bash to the inode for the ELF executable of the bash shell (i.e. /bin/bash). Use readdir(3) -which in turn calls getdents(2)- to read the entries in a directory.

It may happen that a given inode is not any more reachable by some name. This happens in particular when a process is open(2)-ing a file, then unlink(2)-ing it (while keep the opened file descriptor). This is the preferred way to make temporary files. (They will be released by the kernel when no more processes use them).

It may also happen that a given inode has several directories entries pointing to it. (i.e. a file has "several names") This happens with link(2) syscall. (symlink files are created with symlink(2)).

See also path_resolution(7) and unix file system & file system & ext2 & ext3 & ext4 & btrfs & comparison of file systems wikipages. Read also this old file system description (some details are rotten, but the general idea is here, notably the role of VFS)

Copy and move's command effect on inode

You are correct in stating that cp will modify the file instead of deleting and recreating.

Here is a view of the underlying system calls as seen by strace (part of the output of strace cp file1 file2):

open("file2", O_WRONLY|O_TRUNC)         = 4
stat("file2", {st_mode=S_IFREG|0664, st_size=6, ...}) = 0
stat("file1", {st_mode=S_IFREG|0664, st_size=3, ...}) = 0
stat("file2", {st_mode=S_IFREG|0664, st_size=6, ...}) = 0
open("file1", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=3, ...}) = 0
open("file2", O_WRONLY|O_TRUNC) = 4
fstat(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "hi\n", 65536) = 3
write(4, "hi\n", 3) = 3
read(3, "", 65536) = 0
close(4) = 0
close(3) = 0

As you can see, it detects that file2 is present (stat returns 0), but then opens it for writing (O_WRONLY|O_TRUNC) without first doing an unlink.

See for example POSIX.1-2017, which specifies that the destination file shall only be unlink-ed where it could not be opened for writing and -f is used:

A file descriptor for dest_file shall be obtained by performing
actions equivalent to the open() function defined in the System
Interfaces volume of POSIX.1-2017 called using dest_file as the path
argument, and the bitwise-inclusive OR of O_WRONLY and O_TRUNC as the
oflag argument.

If the attempt to obtain a file descriptor fails and the -f option is
in effect, cp shall attempt to remove the file by performing actions
equivalent to the unlink() function defined in the System Interfaces
volume of POSIX.1-2017 called using dest_file as the path argument. If
this attempt succeeds, cp shall continue with step 3b.

This implies that if the destination file exists, the copy will succeed (without resorting to -f behaviour) if the cp process has write permission on it (not necessarily run as the user that owns the file), even if it does not have write permission on the containing directory. By contrast, unlinking and recreating would require write permission on the directory. I would speculate that this is behind the reason why the standard is as it is.

The --remove-destination option on GNU cp will make it do instead what you thought ought to be the default.

Here is the relevant part of the output of strace cp --remove-destination file1 file2. Note the unlink this time.

stat("file2", {st_mode=S_IFREG|0664, st_size=6, ...}) = 0
stat("file1", {st_mode=S_IFREG|0664, st_size=3, ...}) = 0
lstat("file2", {st_mode=S_IFREG|0664, st_size=6, ...}) = 0
unlink("file2") = 0
open("file1", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=3, ...}) = 0
open("file2", O_WRONLY|O_CREAT|O_EXCL, 0664) = 4
fstat(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "hi\n", 65536) = 3
write(4, "hi\n", 3) = 3
read(3, "", 65536) = 0
close(4) = 0
close(3) = 0

When you use mv and the source and destination paths are on the same file filesystem, it will do an rename, and this will have the effect of unlinking any existing file at the target path. Here is the relevant part of the output of strace mv file1 file2.

access("file2", W_OK)                   = 0
rename("file1", "file2") = 0

In either case where an destination path is unlinked (whether explicitly by unlink() as called from cp --remove-destination, or as part of the effect of rename() as called from mv), the link count of the inode to which it was pointing will be decremented, but it will remain on the filesystem if either the link count is still >0 or if any processes have open filehandles on it. Any other (hard) links to this inode (i.e. other directory entries for it) will remain.

Investigating using ls -i

ls -i will show the inode numbers (as the first column when combined with -l), which helps demonstrate what is happening.

Example with default cp action

$ rm file1 file2 file3 

$ echo hi > file1
$ echo world > file2
$ ln file2 file3

$ ls -li file*
49 -rw-rw-r-- 1 myuser mygroup 3 Jun 13 10:43 file1
50 -rw-rw-r-- 2 myuser mygroup 6 Jun 13 10:43 file2
50 -rw-rw-r-- 2 myuser mygroup 6 Jun 13 10:43 file3

$ cp file1 file2
$ ls -li file*
49 -rw-rw-r-- 1 myuser mygroup 3 Jun 13 10:43 file1
50 -rw-rw-r-- 2 myuser mygroup 3 Jun 13 10:43 file2 <=== exsting inode
50 -rw-rw-r-- 2 myuser mygroup 3 Jun 13 10:43 file3 <=== exsting inode

(Note existing inode 50 now has size 3).

Example with --remove-destination

$ rm file1 file2 file3
$ echo hi > file1
$ echo world > file2
$ ln file2 file3

$ ls -li file*
49 -rw-rw-r-- 1 myuser mygroup 3 Jun 13 10:46 file1
50 -rw-rw-r-- 2 myuser mygroup 6 Jun 13 10:46 file2
50 -rw-rw-r-- 2 myuser mygroup 6 Jun 13 10:46 file3

$ cp --remove-destination file1 file2
$ ls -li file*
49 -rw-rw-r-- 1 myuser mygroup 3 Jun 13 10:46 file1
55 -rw-rw-r-- 1 myuser mygroup 3 Jun 13 10:47 file2 <=== new inode
50 -rw-rw-r-- 1 myuser mygroup 6 Jun 13 10:46 file3 <=== existing inode

(Note new inode 55 has size 3. Unmodified inode 50 still has size 6.)

Example with mv

$ rm file1 file2 file3
$ echo hi > file1
$ echo world > file2
$ ln file2 file3

$ ls -li file*
49 -rw-rw-r-- 1 myuser mygroup 3 Jun 13 11:05 file1
50 -rw-rw-r-- 2 myuser mygroup 6 Jun 13 11:05 file2
50 -rw-rw-r-- 2 myuser mygroup 6 Jun 13 11:05 file3

$ mv file1 file2
$ ls -li file*
49 -rw-rw-r-- 1 myuser mygroup 3 Jun 13 11:05 file2 <== existing inode
50 -rw-rw-r-- 1 myuser mygroup 6 Jun 13 11:05 file3 <== existing inode

C - size of a file from an absolute path

stat(2) allows you to specify a file using it's full path name from the root directory (if the path starts with a /) or a relative one (otherwise, see path_resolution(7)), from your current directory. Even if you don't know any of the names (in case yo have only an open file descriptor, and you don't know the name of this file) you have the possibility of making a fstat(2) system call.

Be careful with the fact that the file size can change between the stat(2) call you make to know the file size and whatever you could do after. Think on opening with O_APPEND flag if you want to assure your write will not be interspesed with others'.

Get file name from absolute path in Nodejs?

Use the basename method of the path module:

path.basename('/foo/bar/baz/asdf/quux.html')
// returns
'quux.html'

Here is the documentation the above example is taken from.

Determine a file's path(s) relative to a directory, including symlinks

This, like many things, is more complex than it might appear on the surface.

Each entity in the file system points at an inode, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...

The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.

It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.

It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.


Before we get into this... some comments:

  1. See the end for some benchmarks. I'm not convinced that this is a significant issue, though admittedly this filesystem is on a 6-disk ZFS array, on an i7, so using a lower spec system will take longer...
  2. Given that this is impossible without calling stat() on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)

As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...

To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.

The following function will produce this structure for us:

def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}

for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise

p_dev = p_stat.st_dev
p_ino = p_stat.st_ino

if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]

if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]

if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)

return entity_map

I've produced an example tree that looks like this:

$ tree --inodes
.
├── [ 67687] 4 -> 5
├── [ 67676] 5 -> 4
├── [ 67675] 6 -> dead
├── [ 67676] a
│   └── [ 67679] 1
├── [ 67677] b
│   └── [ 67679] 2 -> ../a/1
├── [ 67678] c
│   └── [ 67679] 3
└── [ 67687] d
└── [ 67688] 4

4 directories, 7 files

The output of this function is:

$ places
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{201: {67679: {'links': ['./a/1', './c/3'],
'symlinks': ['./b/2'],
'type': 'file'},
67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}

If we are interested in ./c/3, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1...

By subsequently searching for the path we are interested in, we can find all other references within this tree:

def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info
$ places ./a/1
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}

The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link... that's an exercise for the reader.

It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify to deal with this nicely - there's even a python module.

#!/usr/bin/env python3

import os, sys, stat
from pprint import pprint

def get_type(mode):
if stat.S_ISDIR(mode):
return 'directory'
if stat.S_ISCHR(mode):
return 'character'
if stat.S_ISBLK(mode):
return 'block'
if stat.S_ISREG(mode):
return 'file'
if stat.S_ISFIFO(mode):
return 'fifo'
if stat.S_ISLNK(mode):
return 'symlink'
if stat.S_ISSOCK(mode):
return 'socket'
return 'unknown'

def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}

for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise

p_dev = p_stat.st_dev
p_ino = p_stat.st_ino

if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]

if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]

if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)

return entity_map

def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info

entity_map = get_map(os.getcwd())

if len(sys.argv) == 2:
entity_info = filter_map(entity_map, sys.argv[1])
pprint(entity_info)
else:
pprint(entity_map)

I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...

Some benchmarks to consider:

  • A dataset of ~3.1k files and links in ~850 directories.
    This runs in less than 3.5 seconds, ~80ms on subsequent runs
  • A dataset of ~30k files and links in ~2.2k directories.
    This runs in less than 30 seconds, ~300ms on subsequent runs
  • A dataset of ~73.5k files and links in ~8k directories.
    This runs in approx 60 seconds, ~800ms on subsequent runs

Using simple maths, that's about 1140 stat() calls per second with an empty cache, or ~90k stat() calls per second once the cache has been filled - I don't think that stat() is as slow as you think it is!

Retrieve filename from file descriptor in C

You can use readlink on /proc/self/fd/NNN where NNN is the file descriptor. This will give you the name of the file as it was when it was opened — however, if the file was moved or deleted since then, it may no longer be accurate (although Linux can track renames in some cases). To verify, stat the filename given and fstat the fd you have, and make sure st_dev and st_ino are the same.

Of course, not all file descriptors refer to files, and for those you'll see some odd text strings, such as pipe:[1538488]. Since all of the real filenames will be absolute paths, you can determine which these are easily enough. Further, as others have noted, files can have multiple hardlinks pointing to them - this will only report the one it was opened with. If you want to find all names for a given file, you'll just have to traverse the entire filesystem.



Related Topics



Leave a reply



Submit