It Is Normal for Linux to Read a Block of Size More Than 4Kb (Or Several Blocks of 4Kb Each) at a Time

Block based storage

  1. In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).

    One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.

  2. It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).

  3. read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.

    The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.

  4. Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).

How to disable request merges in Block device driver?

First check the nomerges value -

              cat /sys/block/sda/queue/nomerges

if it's not already 2, then do:

              echo 2 > /sys/block/sda/queue/nomerges

named pipe blocks on writing before its maximum size

It's because of the way 4KB pages are filled with written data in the pipe implementation in the Linux kernel. More specifically, the kernel appends written data to a page only if the data fits entirely in the page, otherwise puts the data into another page with enough free bytes.

If you write 3 bytes at a time, the pipe pages won't be filled at their full capacity, because the page size (4096) is not a multiple of 3: the nearest multiple is 4095, so each page will end up with 1 "wasted" byte. Multiplying 4095 by 16, which is the total number of pages, you get 65520.

In your second use case, when you write 65520 bytes all at once, you are filling 15 pages entirely (61440 bytes), plus you are putting the remaining 4080 bytes in the last page, which will have 16 bytes still available for subsequent writes: that's why your second write() call with 3 bytes succeeds without blocking.

For full details on the Linux pipe implementation, see https://elixir.bootlin.com/linux/latest/source/fs/pipe.c.

How to determine reasonable number of bytes to read per read system call?

I've pondered basically the same question, and I've come to a very simple conclusion:

Use a conservative default or heuristic, but let the user override it easily if they want.

You see, in some cases the user might not want the maximum throughput for your utility, but perhaps do whatever it is on the background. Perhaps the task is just not that important. Personally, in Linux, I often use nice and ionice utilities to put long-but-not-priority tasks on the back burner, so to speak, so that they don't interfere with my actual work.

Benchmarks within the last decade indicate 128k to 2M block sizes (217 to 221 bytes) to consistently work well -- not far from optimal rates in almost all situations --, with the average slowly shifting towards the larger end of that range. Typically, powers of two sizes seem to work better than non-powers-of-two, although I haven't seen enough benchmarks of various RAID configurations to trust that fully.

Because your utility will almost certainly be recompiled for each new hardware type/generation, I'd prefer to have a default block size, defined at compile time, but have it trivially overridden at run time (via a command-line option, environment variable, and/or configuration file).

If your utility is packaged for current POSIXy OSes, the binaries could use a default that seems to suit best for the types of tasks done on that machine; for example, Raspberry Pis and other SBCs often don't have that much memory to start with, so a smaller (say, 65536 bytes) default block size might work best. Desktop users might not care about memory hogs, so you might use a much larger default block size on current desktop machines.

(Servers, and in high performance computing (which is where I've pondered about this), the block size is basically either benchmarked on the exact hardware and workload, or it is just a barely-informed guess. Typically the latter.)

Alternatively, you could construct a heuristic based on the st_blksizes of the files involved, perhaps multiplied by a default factor, and clamped to some preferred range. However, such heuristics tend to bit-rot fast, as hardware changes.

With heuristics, it is important to remember that the idea is not to always achieve the optimum, but to avoid really poor results. If a user wants to squeeze out the last few percent of performance, they can do some benchmarking within their own workflow, and tune the defaults accordingly. (I personally have, and do.)

Determine the size of a block device

fdisk doesn't understand the partition layout used by my Mac running Linux, nor any other non-PC partition format. (Yes, there's mac-fdisk for old Mac partition tables, and gdisk for newer GPT partition table, but those aren't the only other partition layouts out there.)

Since the kernel already scanned the partition layouts when the block device came into service, why not ask it directly?


$ cat /proc/partitions
major minor #blocks name

8 16 390711384 sdb
8 17 514079 sdb1
8 18 390194752 sdb2
8 32 976762584 sdc
8 33 514079 sdc1
8 34 976245952 sdc2
8 0 156290904 sda
8 1 514079 sda1
8 2 155774272 sda2
8 48 1465138584 sdd
8 49 514079 sdd1
8 50 1464621952 sdd2

Historical perspective to Linux Filesystems

As usual for Wikipedia pages, Block (data storage) is informative despite being far too exuberant about linking all keywords.

In computing (specifically data transmission and data storage), a block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured is said to be blocked. The process of putting data into blocks is called blocking. Blocking is used to facilitate the handling of the data-stream by the computer program receiving the data. Blocked data is normally read a whole block at a time. Blocking is almost universally employed when storing data to 9-track magnetic tape, to rotating media such as floppy disks, hard disks, optical discs and to NAND flash memory.

Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data, though the block size in file systems may be a multiple of the physical block size. In classical file systems, a single block may only contain a part of a single file. This leads to space inefficiency due to internal fragmentation, since file lengths are often not multiples of block size, and thus the last block of files will remain partially empty. This will create slack space, which averages half a block per file. Some newer file systems attempt to solve this through techniques called block suballocation and tail merging.

There's also a reasonable overview of the classical Unix File System.

Traditionally, hard disk geometry (the layout of blocks on the disk itself) has been CHS.

  • Head: the magnetic reader/writer on each (side of a) platter; can move in and out to access different cylinders
  • Cylinder: a track that passes under a head as the platter rotates
  • Sector: a constant-sized amount of data stored contiguously on a portion the cylinder; the smallest unit of data that the drive can deal with

CHS isn't used much these days, as

  • Hard disks no longer use a constant number of sectors per cylinder. More data is squeezed onto a platter by using a constant arclength per sector rather than a constant rotational angle, so there are more sectors on the outer cylinders than there are on the inner cylinders.
  • By the ATA specification, a drive may have no more than 216 cylinders per head, 24 heads, and 28 sectors per cylinder; with 512B sectors, this is a limit of 128GB. Through BIOS INT13, it is not possible to access anything beyond 7.88GB through CHS anyways.
  • For backwards-compatibility, larger drives still claim to have a CHS geometry (otherwise DOS wouldn't be able to boot), but getting to any of the higher data requires using LBA addressing.
  • CHS doesn't even make sense on RAID or non-rotational media.

but for historical reasons, this has affected block sizes: because sector sizes were almost always 512B, filesystem block sizes have always been multiples of 512B. (There is a movement afoot to introduce drives with 1kB and 4kB sector sizes, but compatibility looks rather painful.)

Generally speaking, smaller filesystem block sizes result in less wasted space when storing many small files (unless advanced techniques like tail merging are in use), while larger block sizes reduce external fragmentation and have lower overhead on large disks. The filesystem block size is usually a power of 2, is limited below by the block device's sector size, and is often limited above by the OS's page size.

The page size varies by OS and platform (and, in the case of Linux, can vary by configuration as well). Like block size, smaller block sizes reduce internal fragmentation but require more administrative overhead. 4kB page sizes on 32-bit platforms is common.

Now, on to describe indirect blocks. In the UFS design,

  • An inode describes a file.
  • In the UFS design, the number of pointers to data blocks that an inode could hold is very limited (less than 16). The specific number appears to vary in derived implementations.
  • For small files, the pointers can directly point to the data blocks that compose a file.
  • For larger files, there must be indirect pointers, which point to a block which only contains more pointers to blocks. These may be direct pointers to data blocks belonging to the file, or if the file is very large, they may be even more indirect pointers.

Thus the amount of storage required for a file may be greater than just the blocks containing its data, when indirect pointers are in use.

Not all filesystems use this method for keeping track of the data blocks belong to a file. FAT simply uses a single file allocation table which is effectively a gigantic series of linked lists, and many modern filesystems use extents.



Related Topics



Leave a reply



Submit