Dd: How to Calculate Optimal Blocksize

dd: How to calculate optimal blocksize?

The optimal block size depends on various factors, including the operating system (and its version), and the various hardware buses and disks involved. Several Unix-like systems (including Linux and at least some flavors of BSD) define the st_blksize member in the struct stat that gives what the kernel thinks is the optimal block size:

#include <sys/stat.h>
#include <stdio.h>

int main(void)
{
    struct stat stats;

    if (!stat("/", &stats))
    {
        printf("%u\n", stats.st_blksize);
    }
}

The best way may be to experiment: copy a gigabyte with various block sizes and time that. (Remember to clear kernel buffer caches before each run: echo 3 > /proc/sys/vm/drop_caches).

However, as a rule of thumb, I've found that a large enough block size lets dd do a good job, and the differences between, say, 64 KiB and 1 MiB are minor, compared to 4 KiB versus 64 KiB. (Though, admittedly, it's been a while since I did that. I use a mebibyte by default now, or just let dd pick the size.)

Highest block size for dd command

You could go higher, but it probably won't make any difference. If you go too high, things might actually slow down.

Different SSD devices have different performance profiles. There is no universal, ultimate, answer that's right for every SSD device that exists in this entire world.

The only way to get the right answer is to experiment with various block sizes, and benchmark the performance.

How does stat command calculate the blocks of a file?

The stat command-line tool uses the stat / fstat etc. functions, which return data in the stat structure. The st_blocks member of the stat structure returns:

The total number of physical blocks of size 512 bytes actually allocated on disk. This field is not defined for block special or character special files.

So for your "Email" example, with a size of 965 and a block count of 8, it is indicating that 8*512=4096 bytes are physically allocated on disk. The reason it's not 2 is that the file system on disk does not allocate space in units of 512, it evidently allocates them in units of 4096. (And the unit of allocation may vary depending on file size and filesystem sophistication. E.g. ZFS supports different units of allocation.)

Similarly, for the wxPython example, it indicates that 7056*512 bytes, or 3612672 bytes are physically allocated on disk. You get the idea.

The IO block size is "a hint as to the 'best' unit size for I/O operations" - it's usually the unit of allocation on the physical disk. Don't get confused between the IO block and the block that stat uses to indicate physical size; the blocks for physical size are always 512 bytes.

Update based on comment:

Like I said, st_blocks is how the OS indicates how much space is used by the file on disk. The actual units of allocation on disk are the choice of the file system. For example, ZFS can have allocation blocks of variable size, even in the same file, because of the way it allocates blocks: files start out having a small block size, and the block size keeps on increasing until it reaches a particular point. If the file is later truncated, it will probably keep the old block size. So based on the history of the file, it can have multiple possible block sizes. So given a file size it is not always obvious why it has a particular physical size.

Concrete example: on my Solaris box, with a ZFS file system, I can create a very short file:

$ echo foo > test
$ stat test
  Size: 4               Blocks: 2          IO Block: 512    regular file
(irrelevant details omitted)

OK, small file, 2 blocks, physical disk usage is 1024 for this file.

$ dd if=/dev/zero of=test2 bs=8192 count=4
$ stat test2
  Size: 32768           Blocks: 65         IO Block: 32768  regular file

OK, now we see physical disk usage of 32.5K, and an IO block size of 32K. I then copied it to test3 and truncated this test3 file in an editor:

$ cp test2 test3
$ joe -hex test3
$ stat test3
  Size: 4               Blocks: 65         IO Block: 32768  regular file

Well now, here's a file with 4 bytes in it - just like test - but it's using 32.5K physically on the disk, because of the way the ZFS file system allocates space. Block sizes increase as the file gets larger, but they don't decrease when the file gets smaller. (And yes, this can lead to substantial wasted space depending on the kinds of files and file operations you do on ZFS, which is why it allows you to set the maximum block size on a per-filesystem basis, and change it dynamically.)

Hopefully, you can now appreciate that there isn't necessarily a simple relationship between file size and physical disk usage. Even in the above it's not clear why 32.5K bytes are needed to store a file that's exactly 32K in size - it appears that ZFS generally needs an extra 512 bytes for extra storage of its own. Perhaps it's using that storage for checksums, reference counts, transaction state - file system bookkeeping. By including these extras in the indicated physical file size, it seems like ZFS is trying not to mislead the user as to the physical costs of the file. That doesn't mean it's trivial to reverse-engineer the calculation without knowing intimate details about the underlying file system implementation.

AWS EBS block size

The actual, physical means of connection is over the AWS software-defined Ethernet LAN. EBS is essentially a SAN. The volumes are not physically attached to the instance, but they are physically within the same availability zone, the access is over the network.

If the instance is "EBS Optimized," there's a separate allocation of Ethernet bandwidth for communication between the instance and EBS. Otherwise, the same Ethernet connection that handles all of the IP traffic for the instance is also used by EBS.

The SSDs behind EBS gp2 volumes are 4KiB page-aligned.

See AWS re:Invent 2015 | (STG403) Amazon EBS: Designing for Performance beginning around 24:15 for this.

As explained in AWS re:Invent 2016: Deep Dive on Amazon Elastic Block Store (STG301), an EBS volume is not a physical volume. They're not handing you an SSD drive. An EBS volume is a logical volume that spans numerous distributed devices throughout the availability zone. (The blocks on the devices are also replicated within EBS within the availability zone to a second device.)

These factors should make it apparent that the performance of the actual SSDs is not an especially significant factor in the performance of EBS. EBS, by all appearances, allocates resources in proportion to what you're paying for the volume... which is of course directly proportional to the size of the volume as well as which feature set (volume type) you've selected.

16KiB is the nominal size of an I/O that EBS uses for establishing performance benchmarks for gp2. It probably has no other special significance, as it appears to be related as much or more to the processing resources that EBS allocates to your volume as to the media devices themselves -- EBS volumes live in storage clusters that have "resources" of their own (CPU, memory, network bandwidth, etc.) and 16KiB seems to be a nominal value related to some kind of resource allocation in the EBS infrastructure.

Note that the sc1 and st1 volumes use a very different nominal I/O size: 1 MiB. Obviously, that can't be related to anything about the physical storage device, so this lends credence to the conclusion that the 16KiB number for gp2 (and io1).

A gp2 volume can perform up to the lowest of several limits:

160 MiB/second, depending on the connected instance type‡
The current number of instantaneous IOPS available to the volume, which is the highest of
- 100 IOPS regardless of volume size
- 3 IOPS per provisioned GiB of volume size
- The IOPS credits available for with in your token bucket, capped at 3,000 IOPS
10,000 IOPS per volume regardless of how large the volume is

‡Smaller instance types can't provide 160MiB/second of network bandwidth, anyway. For example, the r3.xlarge has only half a gigabit (500 Mbps) of network bandwidth, limiting your total traffic to EBS to approximately 62.5 MiB/sec, so you won't be able to push any more throughput to an EBS volume than this from an instance of that type. Unless you are using very large instances or very small volumes, the most likely constraint on your EBS performance is going to be the limits of the instance, not the limits of EBS.

You are capped at the first (lowest) threshold in the list above, the impact of the nominal 16 KiB I/O size is this: if your I/Os are smaller than 16KiB, your maximum possible IOPS does not increase, and if they are larger, your maximum possible IOPS may decrease:

an I/O size of 4KiB will not improve performance, since the nominal size of an I/O for rate limiting purposes is established 16KiB, but
an I/O size of 4KiB is unlikely to meaningfully decrease performance with sequential I/Os since, for EBS's accounting purposes, are internally combined. So, if your instance were to make 4 × 4 KiB sequential I/O requests, EBS is likely to count that as 1 I/O anyway
an I/O size of 4KiB and extremely random I/Os would indeed not be combined, so would theoretically perform poorly relative to the same number of 16KiB extremely random I/Os, but instinct and experience tells me this borders on academic and theoretical territory except perhaps in extremely rare cases. It could just as likely hurt as help, since small writes would use the same number of IOPS but transfer more unnecessary data across the wire.
if your I/Os are larger than 16KiB, your maximum IOPS will decrease if your disk bandwidth reaches the 160MiB/s threshold before reaching the IOPS threshold.

A final thought, EBS performs best under load. That is to say, a single thread making a series of random I/Os will not keep the EBS volume's queue filled with requests. When that is not the case, you will not see the maximum possible performance.

See also Amazon EBS Volume Performance on Linux Instances for more discussion of EBS performance.

Determine the size of a block device

fdisk doesn't understand the partition layout used by my Mac running Linux, nor any other non-PC partition format. (Yes, there's mac-fdisk for old Mac partition tables, and gdisk for newer GPT partition table, but those aren't the only other partition layouts out there.)

Since the kernel already scanned the partition layouts when the block device came into service, why not ask it directly?


$ cat /proc/partitions
major minor  #blocks  name

   8       16  390711384 sdb
   8       17     514079 sdb1
   8       18  390194752 sdb2
   8       32  976762584 sdc
   8       33     514079 sdc1
   8       34  976245952 sdc2
   8        0  156290904 sda
   8        1     514079 sda1
   8        2  155774272 sda2
   8       48 1465138584 sdd
   8       49     514079 sdd1
   8       50 1464621952 sdd2

How to calculate space for number of records

A few observations on your approach.

First, since your dealing with records that are variable length it would be helpful to know the "average" record length as that would help to formulate a more accurate prediction of storage. Your approach assumes a worst case scenario of all records being at maximum which is fine for planning purposes but in reality you'll likely see the actual allocation would be lower if the average of the record lengths is lower than the maximum.

The approach you are taking is reasonable but consider that you can inform z/OS of the space requirements in blocks, records, DASD geometry or let DFSMS perform the calculation on your behalf. Refer to this article to get some additional information on options.

Back to your calculations:

You Optimum Block Length (OBL) is really a records per block (RPB) number. Block size divided maximum record length yields the number of records at full length that can be stored in the block. If your average record length is less then you can store more records per block.

The assumption of two blocks per track may be true for your situation but it depends on the actual device type that will be used for the underlying allocation. Here is a link to some of the geometries for supported DASD devices and their geometries.

Sample Image

Your assumption of two blocks per track depends on the device is not correct for 3390's as you would need 64k for two blocks on a track but as you can see the 3390's max out at 56k so you would only get one block per track on the device.

Also, it looks like you did factor in the RDW by adding 4 bytes but someone looking at the question might be confused if they are not familiar with V records on z/OS.In the case of your calculation that would be 61 records per block at 27998 (which is the "optimal block length" so two blocks can fit comfortable on a track).

I'll use the following values:

MaximumRecordLength = RecordLength + 4 for RDW
TotalRecords = Total Records at Maximum Length (worst case)
BlockSize = modeled blocksize
RecordsPerBlock = number of records that can fit in a block (worst case)
BlocksNeeded = number of blocks needed to contain estimated records (worst case)
BlocksPerTrack = from IBM device geometry information
TracksNeeded = TotalRecords / RecordsPerBlock / BlocksPerTrack
Cylinders = Device Tracks per cylinder (15 for most devices)

Example 1:

  Total Records = 51,560
  BlockSize = 32,760
  BlocksPerTrack = 1 (from device table)
  RecordsPerBlock: 32,760 / 449 = 72.96 (72)
  Total Blocks = 51,560 / 72 = 716.11 (717)
  Total Tracks = 717 * 1 = 717
  Cylinders = 717 / 15 = 47.8 (48)

Example 2:

  Total Records = 127,252
  BlockSize = 27,998
  BlocksPerTrack = 2 (from device table)
  RecordsPerBlock: 27,998 / 449 = 62.35 (62)
  Total Blocks = 127,252 / 62 = 2052.45 (2,053)
  Total Tracks = 2,053 / 2 = 1,026.5 (1,027)
  Cylinders = 1027 / 15 = 68.5 (69)

Now, as to the actual allocation. It depends on how you allocated the space, the size of the records. Assuming it was in JCL you could use the RLSE subparameter of the SPACE= to release space when the is created and closed. This should release unused resources.

Given that the records are Variable the estimates are worst case and you would need to know more about the average record lengths to understand the actual allocation in terms of actual space used.

Final thought, all of the work you're doing can be overridden by your storage administrator through ACS routines. I believe that most people today would specify a BLKSIZE=0 and let DFSMS do all of the hard work because that component has more information about where a file will go, what the underlying devices are and the most efficient way of doing the allocation. The days of disk geometry and allocation are more of a campfire story unless your environment has not been administered to do these things for you.

Dd: How to Calculate Optimal Blocksize