200,000 Images in Single Folder in Linux, Performance Issue or Not

200,000 images in single folder in linux, performance issue or not?

Ext3 uses tree to hold directory contents, so its capability to handle a large number of files in a single directory is better than that of those file systems with linear directory listings.
Here you can read the description of the tree used to keep directory contents.

However, 200K files is still a huge number. It's reasonable to move them into subdirectories based on first n characters of file names. This approach lets you keep only file names and not directory names, and when you need to access the file, you know where (in which subdirectory) to look for it.

Does a large number of directories negatively impact performance?

1) The answer depends on

  • OS (e.g. Linux vs Windows) and

  • filesystem (e.g. ext3 vs NTFS).

2) Keep in mind that when you arbitrarily create a new subdirectory, you're using more inodes

3) Linux usually handles "many files/directory" better than Windows

4) A couple of additional links (assuming you're on Linux):

  • 200,000 images in single folder in linux, performance issue or not?

  • https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory

  • https://serverfault.com/questions/147731/do-large-folder-sizes-slow-down-io-performance

  • http://tldp.org/LDP/intro-linux/html/sect_03_01.html

  • *

How many files in a directory is too many? (Downloading data from net)

Performance varies according the the filesystem you're using.

  • FAT: forget it :) (ok, I think the limit is 512 files per directory)
  • NTFS: Althought it can hold 4billion files per folder, it degrades relatively quickly - around a thousand you will start to notice performance issues, several thousand and you'll see explorer appear to hang for quite a while.
  • EXT3: physical limit is 32,000 files, but perf suffers after several thousand files too.

  • EXT4: theoretically limitless

  • ReiserFS, XFS, JFS, BTRFS: these are the good ones for lots of files in a directory as they're more modern and designed to handle many files (the others were designed back in the days when HDDs were measured in MB not GB). Performance is a lot better for lots of files (along with ext4) as they both use a binary search type algorithm for getting the file you want (the others use a more linear one).

How can I store 1 billion images on servers uploaded from a web application?

For the storage part of the project, I would say that you would need something different than a usual file system mounted on dedicated or external disks (SATA, SAS or fiber/SSD).

Glusterfs distributed file system, would be ideal for use a a storage engine, because it can support replicated configurations (for HA) and also distributed (and mixed) configuration to gain in IO speed.

For the organization part of the project, I would think that you should have a main file system (mounted across all clients/web servers), and in this file system you should have separate directories for every user, with two subdirs (one for the high resolution and one for the small resolution pictures).

Finally, the same storage servers can be used as web servers at the same time or we can use different servers (possibly virtual machines XEN, KVM or Vmware). The mounting of the gluster volume to the web servers, should be done with the use of fuse and glusterfs client module (from /etc/fstab). This is a must for the features of the glusterfs to work.

Millions of small graphics files and how to overcome slow file system access on XP

There are several things you could/should do

  • Disable automatic NTFS short file name generation (google it)
  • Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)

  • In any case try making the first six characters of the filename as unique/different as possible

  • If you use the same folder over and (say adding file, removing file, readding files, ...)

    • Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
    • Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
  • As already posted consider splitting up the files in multiple directories.

.e.g. instead of

directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg

use

directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg

Maximum number of files in one ext3 directory while still getting acceptable performance?

Provided you have a distro that supports the dir_index capability then you can easily have 200,000 files in a single directory. I'd keep it at about 25,000 though, just to be safe. Without dir_index, try to keep it at 5,000.

How much does loading images or saving images to the server affect the server load?

You could try building a CDN ( Content Delivery Network ) of sorts.
Point them to a server or servers for upload processing.

Process the files and replicate them to a cluster of file servers ( the CDN )

Then return a link to a CDN servers that only serve back the file.

Bonus Scenario:

Having your content being served from a different server than the one that processes the data allow you to take the processing server down, and still have the content server running.

What is better for performance - many files in one directory, or many subdirectories each with one file?

It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:

How many files in a directory is too many (on Windows and Linux)?

You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.

Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.

How to gather disk usage on a storage system faster than just using du?

There is no magic. In order to gather the disk usage, you'll have to traverse the file system. If you are looking for a method of just doing it at a file system level, that would be easy (just df -h for example)... but it sounds like you want it at a directory level within mount point.

You could perhaps run jobs in parallel on each directory. For example in bash:

for D in `ls -d */`
do
du -s $D &
done

wait

But you are likely to be i/o bound, I think. Also, if you have a lot of top-level directories, this method might be... well... rather taxing since it doesn't have any kind of governing of max number of processes.

If you have GNU Parallel installed you can do something like:

ls -d */ | parallel du -s 

...which would be much better. parallel has a lot of nice features like grouping the output, governing the max processes, etc... and you can also pass in some parameters to tweak it some (although, like I mentioned earlier, you'll be i/o bound, so more processes is not better, in fact less than the default may be preferable).

The only other thought I have on this is to perhaps use disk quotas if that is really the point of what you are trying to do. There is a good tutorial here if you want to read about it.



Related Topics



Leave a reply



Submit