Is It Ok (Performance-Wise) to Have Hundreds or Thousands of Files in the Same Linux Directory

Is it OK (performance-wise) to have hundreds or thousands of files in the same Linux directory?

It depends very much on the file system.

ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.

ext4 supposedly fixes these problems, but I cannot vouch for it personally.

XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.

So if you really need a huge number of files, I would use XFS or maybe ext4.

Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...

For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".

What is better for performance - many files in one directory, or many subdirectories each with one file?

It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:

How many files in a directory is too many (on Windows and Linux)?

You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.

Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.

Storing & accessing up to 10 million files in Linux

You should definitely store the files in subdirectories.

EXT4 and XFS both use efficient lookup methods for file names, but if you ever need to run tools over the directories such as ls or find you will be very glad to have the files in manageable chunks of 1,000 - 10,000 files.

The inode number thing is to improve the sequential access performance of the EXT filesystems. The metadata is stored in inodes and if you access these inodes out of order then the metadata accesses are randomized. By reading your files in inode order you make the metadata access sequential too.

How many files in a directory is too many? (Downloading data from net)

Performance varies according the the filesystem you're using.

  • FAT: forget it :) (ok, I think the limit is 512 files per directory)
  • NTFS: Althought it can hold 4billion files per folder, it degrades relatively quickly - around a thousand you will start to notice performance issues, several thousand and you'll see explorer appear to hang for quite a while.
  • EXT3: physical limit is 32,000 files, but perf suffers after several thousand files too.

  • EXT4: theoretically limitless

  • ReiserFS, XFS, JFS, BTRFS: these are the good ones for lots of files in a directory as they're more modern and designed to handle many files (the others were designed back in the days when HDDs were measured in MB not GB). Performance is a lot better for lots of files (along with ext4) as they both use a binary search type algorithm for getting the file you want (the others use a more linear one).

How many files can I put in a directory?


FAT32:

  • Maximum number of files: 268,173,300
  • Maximum number of files per directory: 216 - 1 (65,535)
  • Maximum file size: 2 GiB - 1 without LFS, 4 GiB - 1 with

NTFS:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum file size

    • Implementation: 244 - 26 bytes (16 TiB - 64 KiB)
    • Theoretical: 264 - 26 bytes (16 EiB - 64 KiB)
  • Maximum volume size

    • Implementation: 232 - 1 clusters (256 TiB - 64 KiB)
    • Theoretical: 264 - 1 clusters (1 YiB - 64 KiB)

ext2:

  • Maximum number of files: 1018
  • Maximum number of files per directory: ~1.3 × 1020 (performance issues past 10,000)
  • Maximum file size

    • 16 GiB (block size of 1 KiB)
    • 256 GiB (block size of 2 KiB)
    • 2 TiB (block size of 4 KiB)
    • 2 TiB (block size of 8 KiB)
  • Maximum volume size

    • 4 TiB (block size of 1 KiB)
    • 8 TiB (block size of 2 KiB)
    • 16 TiB (block size of 4 KiB)
    • 32 TiB (block size of 8 KiB)

ext3:

  • Maximum number of files: min(volumeSize / 213, numberOfBlocks)
  • Maximum file size: same as ext2
  • Maximum volume size: same as ext2

ext4:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum number of files per directory: unlimited
  • Maximum file size: 244 - 1 bytes (16 TiB - 1)
  • Maximum volume size: 248 - 1 bytes (256 TiB - 1)

How many files in a directory is too many (on Windows and Linux)?

According to this Microsoft article, the lookup time of a directory increases proportional to the square of the number of entries. (Although that was a bug against NT 3.5.)

A similar question was asked on the Old Joel on Software Forum. One answer was that performance seems to drop between 1000 and 3000 files, and one poster hit a hard limit at 18000 files. Still another post claims that 300,000 files are possible but search times decrease rapidly as all the 8.3 filenames are used up.

To avoid large directories, create one, two or more levels of subdirectories and hash the files into those. The simplest kind of hash uses the letters of the filename. So a file starting abc0001.txt would be placed as a\b\c\abc0001.txt, assuming you chose 3 levels of nesting. 3 is probably overkill - using two characters per directory reduces the number of nesting levels. e.g. ab\abc0001.txt. You will only need to go to two levels of nesting if you anticipate that any directory will have vastly more than ca. 3000 files.

200,000 images in single folder in linux, performance issue or not?

Ext3 uses tree to hold directory contents, so its capability to handle a large number of files in a single directory is better than that of those file systems with linear directory listings.
Here you can read the description of the tree used to keep directory contents.

However, 200K files is still a huge number. It's reasonable to move them into subdirectories based on first n characters of file names. This approach lets you keep only file names and not directory names, and when you need to access the file, you know where (in which subdirectory) to look for it.

Does a large number of directories negatively impact performance?

1) The answer depends on

  • OS (e.g. Linux vs Windows) and

  • filesystem (e.g. ext3 vs NTFS).

2) Keep in mind that when you arbitrarily create a new subdirectory, you're using more inodes

3) Linux usually handles "many files/directory" better than Windows

4) A couple of additional links (assuming you're on Linux):

  • 200,000 images in single folder in linux, performance issue or not?

  • https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory

  • https://serverfault.com/questions/147731/do-large-folder-sizes-slow-down-io-performance

  • http://tldp.org/LDP/intro-linux/html/sect_03_01.html

  • *

Does CoW work for files that have the same content but were not created by `cp --reflink`?


In this case, do they use the disk space of two files?

Yes: the files are created independently and use individual disk space.

If you already have many files with duplicate contents and want to take advantage of BTRFS' CoW features, you are looking for (offline) deduplication using tools like duperemove.



Related Topics



Leave a reply



Submit