Storing & Accessing Up to 10 Million Files in Linux

Storing & accessing up to 10 million files in Linux

You should definitely store the files in subdirectories.

EXT4 and XFS both use efficient lookup methods for file names, but if you ever need to run tools over the directories such as ls or find you will be very glad to have the files in manageable chunks of 1,000 - 10,000 files.

The inode number thing is to improve the sequential access performance of the EXT filesystems. The metadata is stored in inodes and if you access these inodes out of order then the metadata accesses are randomized. By reading your files in inode order you make the metadata access sequential too.

Is it OK (performance-wise) to have hundreds or thousands of files in the same Linux directory?

It depends very much on the file system.

ext2 and ext3 have a hard limit of 32,000 files per directory. This is somewhat more than you are asking about, but close enough that I would not risk it. Also, ext2 and ext3 will perform a linear scan every time you access a file by name in the directory.

ext4 supposedly fixes these problems, but I cannot vouch for it personally.

XFS was designed for this sort of thing from the beginning and will work well even if you put millions of files in the directory.

So if you really need a huge number of files, I would use XFS or maybe ext4.

Note that no file system will make "ls" run fast if you have an enormous number of files (unless you use "ls -f"), since "ls" will read the entire directory and the sort the names. A few tens of thousands is probably not a big deal, but a good design should scale beyond what you think you need at first glance...

For the application you describe, I would probably create a hierarchy instead, since it is hardly any additional coding or mental effort for someone looking at it. Specifically, you can name your first file "00/00/01" instead of "000001".

Using filesystem as database for 15M files - is it efficient?

There are a few reasons you probably want to look at a database (not necessarily MySQL) rather than the file system for this sort of thing:

More files in one directory slow things down

Although XFS is supposed to be very clever about allocating resources, most filesystems experience degrading performance the more files you have in a single directory. It also becomes a headache to deal with them on the command line. Having a look at this (http://oss.sgi.com/projects/xfs/datasheet.pdf) there's a graph on there about look ups, which only goes up to 50k per directory, and it's on the way down.

Overhead

There is a certain amount of filesystem overhead per file. If you have many small files, you may find that the final store bloats as a result of this.

Key cleaning

Are all your words safe to put in a filename? Are you sure? A slash or two in there is really going to ruin your day.

NoSQL might be a good option

Something like MongoDB/Redis might be a good option for this. MongoDB can store single documents of up to 16mb and isn't much harder to use that putting things on the file system. If you are storing 15mb documents, you might be getting a bit too close for comfort on that limit, but there are other options.

The nice thing about this is, the lookup performance is likely to be pretty good off the bat and if you later on find it isn't you can scale the performance by creating a cluster etc. Any system like this will also do a good job of managing the files on the disk intelligently for good performance.

If you are going to use the disk

Consider taking an MD5 hash of the word you want to store, and base your filename on this. For example the MD5 of azpdk is:

1c58fb66d5a4d6a1ebe5ec9e217fbbf9

You could use this to create a filename e.g.:

my_directory/1c5/8fb/66d5a4d6a1ebe5ec9e217fbbf9

This has a few nice features:

  • The hash takes care of scary characters
  • The directories spread out the data, so no directory has more than 4096 entries
  • This means the lookup performance should be relatively decent

Hope that helps.

How many files can I put in a directory?

FAT32:

  • Maximum number of files: 268,173,300
  • Maximum number of files per directory: 216 - 1 (65,535)
  • Maximum file size: 2 GiB - 1 without LFS, 4 GiB - 1 with

NTFS:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum file size

    • Implementation: 244 - 26 bytes (16 TiB - 64 KiB)
    • Theoretical: 264 - 26 bytes (16 EiB - 64 KiB)
  • Maximum volume size

    • Implementation: 232 - 1 clusters (256 TiB - 64 KiB)
    • Theoretical: 264 - 1 clusters (1 YiB - 64 KiB)

ext2:

  • Maximum number of files: 1018
  • Maximum number of files per directory: ~1.3 × 1020 (performance issues past 10,000)
  • Maximum file size

    • 16 GiB (block size of 1 KiB)
    • 256 GiB (block size of 2 KiB)
    • 2 TiB (block size of 4 KiB)
    • 2 TiB (block size of 8 KiB)
  • Maximum volume size

    • 4 TiB (block size of 1 KiB)
    • 8 TiB (block size of 2 KiB)
    • 16 TiB (block size of 4 KiB)
    • 32 TiB (block size of 8 KiB)

ext3:

  • Maximum number of files: min(volumeSize / 213, numberOfBlocks)
  • Maximum file size: same as ext2
  • Maximum volume size: same as ext2

ext4:

  • Maximum number of files: 232 - 1 (4,294,967,295)
  • Maximum number of files per directory: unlimited
  • Maximum file size: 244 - 1 bytes (16 TiB - 1)
  • Maximum volume size: 248 - 1 bytes (256 TiB - 1)

single folder or many folders for storing 8 million images of hundreds of stores?

Depends on the file system you are using. (EXT3, NTFS, FAT, etc).

Each will have different folder size limits and performance characteristics.

For 8 million files t will be safest to separate as many folders as you can. Then you have the option to make them different physical drives if you run into scaling issues.

If you're on Linux, its easy to mount another drive right among your existing folder structure. You could alternatively use a symbolic link.

images/Store1/FILES...
images/Store2 ---> /mount/SDA01/Store2/ (symbolic link to a separate drive)

Limits

See this SuperUser question for more detail about different File System limits: https://superuser.com/questions/446282/max-files-per-directory-on-ntfs-vol-vs-fat32

Note these are the absolute limits of what the system can handle. Performance at the upper bound of those limits will definitely suffer.

what is the max files per directory in EXT4?

It depends upon the MKFS parameters used during the filesystem creation. Different Linux flavors have different defaults, so it's really impossible to answer your question definitively.

Maximum number of files in one ext3 directory while still getting acceptable performance?

Provided you have a distro that supports the dir_index capability then you can easily have 200,000 files in a single directory. I'd keep it at about 25,000 though, just to be safe. Without dir_index, try to keep it at 5,000.



Related Topics



Leave a reply



Submit