What Happens If There Are Too Many Files Under a Single Directory in Linux

What happens if there are too many files under a single directory in Linux?

ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.

If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.

The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.

Edit:

ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.

The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.

I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.

How many files can I put in a directory?

FAT32:

  • Maximum number of files: 268,173,300
  • Maximum number of files per directory: 216 - 1 (65,535)
  • Maximum file size: 2 GiB - 1 without LFS, 4 GiB - 1 with

NTFS:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum file size

    • Implementation: 244 - 26 bytes (16 TiB - 64 KiB)
    • Theoretical: 264 - 26 bytes (16 EiB - 64 KiB)
  • Maximum volume size

    • Implementation: 232 - 1 clusters (256 TiB - 64 KiB)
    • Theoretical: 264 - 1 clusters (1 YiB - 64 KiB)

ext2:

  • Maximum number of files: 1018
  • Maximum number of files per directory: ~1.3 × 1020 (performance issues past 10,000)
  • Maximum file size

    • 16 GiB (block size of 1 KiB)
    • 256 GiB (block size of 2 KiB)
    • 2 TiB (block size of 4 KiB)
    • 2 TiB (block size of 8 KiB)
  • Maximum volume size

    • 4 TiB (block size of 1 KiB)
    • 8 TiB (block size of 2 KiB)
    • 16 TiB (block size of 4 KiB)
    • 32 TiB (block size of 8 KiB)

ext3:

  • Maximum number of files: min(volumeSize / 213, numberOfBlocks)
  • Maximum file size: same as ext2
  • Maximum volume size: same as ext2

ext4:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum number of files per directory: unlimited
  • Maximum file size: 244 - 1 bytes (16 TiB - 1)
  • Maximum volume size: 248 - 1 bytes (256 TiB - 1)

How many files in a directory is too many? (Downloading data from net)

Performance varies according the the filesystem you're using.

  • FAT: forget it :) (ok, I think the limit is 512 files per directory)
  • NTFS: Althought it can hold 4billion files per folder, it degrades relatively quickly - around a thousand you will start to notice performance issues, several thousand and you'll see explorer appear to hang for quite a while.
  • EXT3: physical limit is 32,000 files, but perf suffers after several thousand files too.

  • EXT4: theoretically limitless

  • ReiserFS, XFS, JFS, BTRFS: these are the good ones for lots of files in a directory as they're more modern and designed to handle many files (the others were designed back in the days when HDDs were measured in MB not GB). Performance is a lot better for lots of files (along with ext4) as they both use a binary search type algorithm for getting the file you want (the others use a more linear one).

How many files in a directory is too many (on Windows and Linux)?

According to this Microsoft article, the lookup time of a directory increases proportional to the square of the number of entries. (Although that was a bug against NT 3.5.)

A similar question was asked on the Old Joel on Software Forum. One answer was that performance seems to drop between 1000 and 3000 files, and one poster hit a hard limit at 18000 files. Still another post claims that 300,000 files are possible but search times decrease rapidly as all the 8.3 filenames are used up.

To avoid large directories, create one, two or more levels of subdirectories and hash the files into those. The simplest kind of hash uses the letters of the filename. So a file starting abc0001.txt would be placed as a\b\c\abc0001.txt, assuming you chose 3 levels of nesting. 3 is probably overkill - using two characters per directory reduces the number of nesting levels. e.g. ab\abc0001.txt. You will only need to go to two levels of nesting if you anticipate that any directory will have vastly more than ca. 3000 files.

How do you handle the Too many files problem when working in Bash?

In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):

find ../path -exec grep foo '{}' +

The use of + rather than ; as the last argument triggers this behavior.

Is it OK (performance-wise) to have hundreds or thousands of files in the same Linux directory?

It depends very much on the file system.

ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.

ext4 supposedly fixes these problems, but I cannot vouch for it personally.

XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.

So if you really need a huge number of files, I would use XFS or maybe ext4.

Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...

For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".

Will too many files storing in one folder make HTTP request for one of them slow?

No, the performance does not depend on the number of files that are in a directory. The reason why opening the folder in Windows explorer is slow is because it has to render icons and various other GUI related things for each file.

When the web server fetches a file, it doesn't need to do that. It just (more or less) directly goes to the location of the file on the disk.

EDIT: Millions is kind of pushing the limits of your file system (I assume NTFS in your case). It appears that anything over 10,000 files in a directory starts to degrade your performance. So not only from a performance standpoint, but from an organizational standpoint as well, you may want to consider separating them into subdirectories.

What is better for performance - many files in one directory, or many subdirectories each with one file?

It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:

How many files in a directory is too many (on Windows and Linux)?

You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.

Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.

Maximum number of files/directories on Linux?

ext[234] filesystems have a fixed maximum number of inodes; every file or directory requires one inode. You can see the current count and limits with df -i. For example, on a 15GB ext3 filesystem, created with the default settings:

Filesystem           Inodes  IUsed   IFree IUse% Mounted on
/dev/xvda 1933312 134815 1798497 7% /

There's no limit on directories in particular beyond this; keep in mind that every file or directory requires at least one filesystem block (typically 4KB), though, even if it's a directory with only a single item in it.

As you can see, though, 80,000 inodes is unlikely to be a problem. And with the dir_index option (enablable with tune2fs), lookups in large directories aren't too much of a big deal. However, note that many administrative tools (such as ls or rm) can have a hard time dealing with directories with too many files in them. As such, it's recommended to split your files up so that you don't have more than a few hundred to a thousand items in any given directory. An easy way to do this is to hash whatever ID you're using, and use the first few hex digits as intermediate directories.

For example, say you have item ID 12345, and it hashes to 'DEADBEEF02842.......'. You might store your files under /storage/root/d/e/12345. You've now cut the number of files in each directory by 1/256th.

Maximum number of files in one ext3 directory while still getting acceptable performance?

Provided you have a distro that supports the dir_index capability then you can easily have 200,000 files in a single directory. I'd keep it at about 25,000 though, just to be safe. Without dir_index, try to keep it at 5,000.



Related Topics



Leave a reply



Submit