What Determines The Order Directory Entries Are Returned by Getdents

What determines the order directory entries are returned by getdents?

If you really are determined to change this program's behaviour (of which I assume that you don't have the source code available), you can use LD_PRELOAD to hook the call to opendir and readdir and replace it with your own, sorting wrapper. An example how such a hook could look like is the following:

#define _GNU_SOURCE 1
#include <stdio.h>
#include <dirent.h>
#include <dlfcn.h>
#include <stdlib.h>
#include <string.h>

struct __dirstream
{
int __fd;
char *__data;
size_t __allocation;
size_t __offset;
size_t __size;
struct dirent __entry;
};

typedef struct _dirent_list {
struct dirent *value;
struct _dirent_list *next;
} dirent_list;

typedef struct _my_DIR {
struct __dirstream orig;
dirent_list *first_entry;
int first_readdir;
} my_DIR;

DIR *opendir(const char *name) {
DIR *(*orig_opendir)(const char*) = dlsym(RTLD_NEXT, "opendir");
DIR *dir = orig_opendir(name);

// save additional information along with the
// original DIR structure
my_DIR *my_dir = calloc(1, sizeof(*my_dir));
my_dir->first_readdir = 1;
memcpy(my_dir, dir, sizeof(*dir));
return (DIR*)my_dir;
}

struct dirent *readdir(DIR *dir) {
struct dirent *(*orig_readdir)(DIR*) = dlsym(RTLD_NEXT, "readdir");
my_DIR *my_dir = (my_DIR*)dir;
dirent_list *item;

if (my_dir->first_readdir) {
struct dirent *entry;
while ((entry = orig_readdir(dir))) {
// exercise for the reader:
// implement insertion sort here
item = calloc(1, sizeof(*item));
item->value = entry;
item->next = my_dir->first_entry;
my_dir->first_entry = item;
}
my_dir->first_readdir = 0;
}

if (!my_dir->first_entry)
return NULL;

item = my_dir->first_entry;
struct dirent *result = item->value;
my_dir->first_entry = item->next;
free(item);

return result;
}

It overrides opendir and readdir to return the entries in reverse order (you can adapt this for sorting too). This is how you use it with a program test that simply lists the directory entries in the order they are received:

$ gcc -Wall -shared -fPIC -o libhookdir.so hookdir.c -ldl
$ ./test
..
test
.
hookdir.c
libhookdir.so
test.c
$ LD_PRELOAD=./libhookdir.so ./test
test.c
libhookdir.so
hookdir.c
.
test
..

Hah! This works. We just hooked a libc function.

How are dirent entries ordered?

They are not ordered in any relevant way. It's up to the implementation to retrieve and return directory entries in whatever order is most convenient.

Advanced Programming in the UNIX Environment, 3rd ed., goes a little further and even says that the order is usually not alphabetical (Chapter 4, Section 4.22):

Note that the ordering of entries within the directory is
implementation dependent and is usually not alphabetical.

If you're wondering, the output of ls is sorted because ls sorts it.

Why does Linux use getdents() on directories instead of read()?

getdents will return struct linux_dirent. It will do this for any underlying type of filesystem. The "on disk" format could be completely different, known only to the given filesystem driver, so a simple userspace read call could not work. That is, getdents may convert from the native format to fill the linux_dirent.

couldn't the same thing be said about reading bytes from a file with read()? The on disk format of the data within a file isn't necessary uniform across filesystems or even contiguous on disk - thus, reading a series of bytes from disk would again be something I expect to be delegated to the file system driver.

The discontiguous file data in handled by the VFS ["virtual filesystem"] layer. Regardless of how a FS chooses to organize the block list for a file (e.g. ext4 uses "inodes": "index" or "information" nodes. these use an "ISAM" ("index sequential access method") organization. But, an MS/DOS FS can have a completely different organization).

Each FS driver registers a table of VFS function callbacks when it's started. For a given operation (e.g. open/close/read/write/seek), there is corresponding entry in the table.

The VFS layer (i.e. from the userspace syscall) will "call down" into the FS driver and the FS driver will perform the operation, doing whatever it deems necessary to fulfill the request.

I assume that the FS driver would know about the location of the data inside a regular file on disk - even if the data was fragmented.

Yes. For example, if the read request is to read the first three blocks from the file (e.g. 0,1,2), the FS will look up the indexing information for the file and get a list of physical blocks to read (e.g. 1000000,200,37) from the disk surface. This is all handled transparently in the FS driver.

The userspace program will simply see its buffer filled up with the correct data, without regard to how complex the FS indexing and block fetch had to be.

Perhaps it is [loosely] more proper to refer to this as transferring inode data as there are inodes for files (i.e. an inode has the indexing information to "scatter/gather" the FS blocks for the file). But, the FS driver also uses this internally to read from a directory. That is, each directory has an inode to keep track of the indexing information for that directory.

So, to an FS driver, a directory is much like a flat file that has specially formatted information. These are the directory "entries". This is what getdents returns. This "sits on top of" the inode indexing layer.

Directory entries can be variable length [based on the length of the filename]. So, the on disk format would be (call it "Type A"):

static part|variable length name
static part|variable length name
...

But ... some FSes organize themselves differently (call it "Type B"):

<static1>,<static2>...
<variable1>,<variable2>,...

So, the type A organization might be read atomically by a userspace read(2) call, the type B would have difficulty. So, the getdents VFS call handles this.

couldn't the VFS also present a "linux_dirent" view of a directory like the VFS presents a "flat view" of a file?

That is what getdents is for.

Then again, I'm assuming that a FS driver knows the type of each file and thus could return a linux_dirent when read() is called on a directory rather than a series of bytes.

getdents did not always exist. When dirents were fixed size and there was only one FS format, the readdir(3) call probably did read(2) underneath and got a series of bytes [which is only what read(2) provides]. Actually, IIRC, in the beginning there was only readdir(2) and getdents and readdir(3) did not exist.

But, what do you do if the read(2) is "short" (e.g. two bytes too small)? How do you communicate that to the app?

My question is more like since the FS driver can determine whether a file is a directory or a regular file (and I'm assuming it can), and since it has to intercept all read() calls eventually, why isn't read() on a directory implemented as reading the linux_dirent?

read on a dir isn't intercepted and converted to getdents because the OS is minimalist. It expects you to know the difference and make the appropriate syscall.

You do open(2) for files or dirs [opendir(3) is wrapper and does open(2) underneath]. You can read/write/seek for file and seek/getdents for dirs.

But ... doing read for returns EISDIR. [Side note: I had forgotten this in my original comments]. In the simple "flat data" model it provides, there isn't a way to convey/control all that getdents can/does.

So, rather than allow an inferior way to get partial/wrong info, it's simpler for the kernel and an app developer to go through the getdents interface.

Further, getdents does things atomically. If you're reading directory entries in a given program, there may be other programs that are creating and deleting files in that directory or renaming them--right in the middle of your getdents sequence.

getdents will present an atomic view. Either a file exists or it doesn't. It's been renamed or it hasn't. So, you don't get a "half modified" view, regardless of how much "turmoil" is happening around you. When you ask getdents for 20 entries, you'll get them [or 10 if there's only that much].

Side note: A useful trick is to "overspecify" the count. That is, tell getdents you want 50,000 entries [you must provide the space]. You'll usually get back something like 100 or so. But, now, what you've got is an atomic snapshot in time for the full directory. I sometimes do this instead of looping with a count of 1--YMMV. You still have to protect against immediate disappearance but at least you can see it (i.e. a subsequent file open fails)

So, you always get "whole" entries and no entry for a just deleted file. That is not to say that the file is still there, merely that it was there at the time of the getdents. Another process may instantly erase it, but not in the middle of the getdents

If read(2) were allowed, you'd have to guess at how much data to read and wouldn't know which entries were fully formed on in a partial state. If the FS had the type B organization above, a single read could not atomically get the static portion and variable portion in a single step.

It would be philosophically incorrect to slow down read(2) to do what getdents does.

getdents, unlink, creat, rmdir, and rename (etc.) operations are interlocked and serialized to prevent any inconsistencies [not to mention FS corruption or leaked/lost FS blocks]. In other words, these syscalls all "know about each other".

If pgmA renames "x" to "z" and pgmB renames "y" to "z", they don't collide. One goes first and another second but no FS blocks are ever lost/leaked. getdents gets the whole view (be it "x y", "y z", "x z" or "z"), but it will never see "x y z" simultaneously.

How to list first level directories only in C?

Unfortunately, all solutions based on shell expansion are limited by the maximum command line length. Which varies (run true | xargs --show-limits to find out); on my system, it is about two megabytes. Yes, many will argue that it suffices -- as did Bill Gates on 640 kilobytes, once.

(When running certain parallel simulations on non-shared filesystems, I do occasionally have tens of thousands of files in the same directory, during the collection phase. Yes, I could do that differently, but that happens to be the easiest and most robust way to collect the data. Very few POSIX utilities are actually silly enough to assume "X is sufficient for everybody".)

Fortunately, there are several solutions. One is to use find instead:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d");

You can also format the output as you wish, not depending on locale:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\n'");

If you want to sort the output, use \0 as the separator (since filenames are allowed to contain newlines), and -t= for sort to use \0 as the separator, too. tr will convert them to newlines for you:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\0' | sort -t= | tr -s '\0' '\n'");

If you want the names in an array, use glob() function instead.

Finally, as I like to harp every now and then, one can use the POSIX nftw() function to implement this internally:

#define _GNU_SOURCE
#include <stdio.h>
#include <ftw.h>

#define NUM_FDS 17

int myfunc(const char *path,
const struct stat *fileinfo,
int typeflag,
struct FTW *ftwinfo)
{
const char *file = path + ftwinfo->base;
const int depth = ftwinfo->level;

/* We are only interested in first-level directories.
Note that depth==0 is the directory itself specified as a parameter.
*/
if (depth != 1 || (typeflag != FTW_D && typeflag != FTW_DNR))
return 0;

/* Don't list names starting with a . */
if (file[0] != '.')
printf("%s/\n", path);

/* Do not recurse. */
return FTW_SKIP_SUBTREE;
}

and the nftw() call to use the above is obviously something like

if (nftw(".", myfunc, NUM_FDS, FTW_ACTIONRETVAL)) {
/* An error occurred. */
}

The only "issue" in using nftw() is to choose a good number of file descriptors the function may use (NUM_FDS). POSIX says a process must always be able to have at least 20 open file descriptors. If we subtract the standard ones (input, output, and error), that leaves 17. The above is unlikely to use more than 3, though.

You can find the actual limit using sysconf(_SC_OPEN_MAX), and subtracting the number of descriptors your process may use at the same time. In current Linux systems, it is typically limited to 1024 per process.

The good thing is, as long as that number is at least 4 or 5 or so, it only affects the performance: it just determines how deep nftw() can go in the directory tree structure, before it has to use workarounds.

If you want to create a test directory with lots of subdirectories, use something like the following Bash:

mkdir lots-of-subdirs
cd lots-of-subdirs
for ((i=0; i<100000; i++)); do mkdir directory-$i-has-a-long-name-since-command-line-length-is-limited ; done

On my system, running

ls -d */

in that directory yields bash: /bin/ls: Argument list too long error, while the find command and the nftw() based program all run just fine.

You also cannot remove the directories using rmdir directory-*/ for the same reason. Use

find . -name 'directory-*' -type d -print0 | xargs -r0 rmdir

instead. Or just remove the entire directory and subdirectories,

cd ..
rm -rf lots-of-subdirs

Meaning of field d_off in last struct dirent

"SVr4" means Unix System V Release 4. Solaris is based on that, and Solaris says:

The d_off entry contains a value which is interpretable only by the filesystem that generated it. It may be supplied as an offset to lseek(2) to find the entry following the current one in a directory.

If you look at the example in the Linux manpage, you'll find a program that uses getdents. It doesn't rely on the d_off of the final entry, which is apparently indeterminate, but on the return value from getdents, to determine how many entries there are.

Btw., the Linux manpage also states quite clearly that you shouldn't be using the getdents system call and that it isn't even supported by GLibc. Use the POSIX readdir interface instead.

getdents() System Call

My best guess is that you've mixed up the legacy and "64" versions of the getdents syscall. Even on 64-bit systems, there seems to be a legacy version (without the 64) that writes a structure lacking the d_type member (so the first character of the name would get misinterpreted as the d_type member if you're using the modern "64" version of the structure) in addition to the (correct) getdents64 syscall.

What does lseek() mean for a directory file descriptor?

On my test system, if you use opendir(), and readdir() through all the entries in the directory, telldir() then returns the same value:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
int fd = open(".", O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}

off_t o = lseek(fd, 0, SEEK_END);
if (o == (off_t)-1) {
perror("lseek");
return 1;
}

printf("Via lseek: %ld\n", (long)o);
close(fd);

DIR *d = opendir(".");
if (!d) {
perror("opendir");
return 1;
}
while (readdir(d)) {
}

printf("via telldir: %ld\n", telldir(d));
closedir(d);

return 0;
}

outputs

Via lseek: 9223372036854775807
via telldir: 9223372036854775807

Quoting from the telldir(3) man page:

In early filesystems, the value returned by telldir() was a simple file offset within a directory. Modern filesystems use tree or hash structures, rather than flat tables, to represent directories. On such filesystems, the value returned by telldir() (and used internally by readdir(3)) is a "cookie" that is used by the implementation to derive a position within a directory. Application programs should treat this strictly as an opaque value, making no assumptions about its contents.

It's a magic number that indicates that the index into the directory's contents is at the end. Don't count on the number always being the same, or being portable. It's a black box. And stick with the dirent API for traversing directory contents unless you really know exactly what you're doing (Under the hood on Linux + glibc, opendir(3) calls openat(2) on the directory, readdir(3) fetches information about its contents with getdents(2), and seekdir(3) calls lseek(2), but that's just implementation details)



Related Topics



Leave a reply



Submit