Optimal Hdf5 Dataset Chunk Shape for Reading Rows

Optimal HDF5 dataset chunk shape for reading rows

Finding the right chunk cache size

At first I want to discuss some general things.
It is very important to know that each individual chunk could only be read or written as a whole. The standard chunk-cache size of h5py which can avoid excessive disk I/Os is only one MB per default and should in many cases be increased, which will be discussed later on.

As an example:

We have a dset with shape (639038, 10000), float32 (25,5 GB uncompressed)
we want to write our data column wise dset[:,i]=arr and read it row wise arr=dset[i,:]
we choose a completely wrong chunk-shape for this type of work ie (1,10000)

In this case reading speed won't be to bad (although the chunk size is a little small) because we read only the data we are using. But what happens when we write on that dataset? If we access a column one floating point number of each chunk is written. This means we are actually writing the whole dataset (25,5 GB) with every iteration and read the whole dataset every other time. This is because if you modify a chunk, you have to read it first if it is not cached (I assume a chunk-cache-size below 25,5 GB here).

So what can we improve here?
In such a case we have to make a compromise between write/read speed and the memory which is used by the chunk-cache.

An assumption which will give both decent/read and write speed:

We choose a chunk-size of (100, 1000)
If we want to iterate over the first Dimension we need at least (1000*639038*4 ->2,55 GB) cache to avoid additional IO-overhead as described above and (100*10000*4 -> 0,4 MB).
So we should provide at least 2,6 GB chunk-data-cache in this example.

Conclusion
There is no generally right chunk size or shape, it depends heavily on the task which one to use. Never choose your chunk size or shape without making some minds about the chunk-cache. RAM is orders of magnite faster than the fastest SSD in regards of random read/write.

Regarding your problem
I would simply read the random rows, the improper chunk-cache-size is your real problem.

Compare the performance of the following code with your version:

import h5py as h5
import time
import numpy as np

def ReadingAndWriting():
    File_Name_HDF5='Test.h5'

    #shape = (639038, 10000)
    shape = (639038, 1000)
    chunk_shape=(100, 1000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    #We are using 4GB of chunk_cache_mem here ("rdcc_nbytes")
    f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
    d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

    #Writing columns
    t1=time.time()
    for i in range(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array, 1)

    f.close()
    print(time.time()-t1)

    # Reading random rows
    # If we read one row there are actually 100 read, but if we access a row
    # which is already in cache we would see a huge speed up.
    f = h5.File(File_Name_HDF5,'r',rdcc_nbytes=1024**2*4000,rdcc_nslots=1e7)
    d = f["Test"]
    for j in range(0,639):
        t1=time.time()
        # With more iterations it will be more likely that we hit a already cached row
        inds=np.random.randint(0, high=shape[0]-1, size=1000)
        for i in range(0,inds.shape[0]):
            Array=np.copy(d[inds[i],:])
        print(time.time()-t1)
    f.close()

The simplest form of fancy slicing

I wrote in the comments, that I couldn't see this behavior in recent versions. I was wrong. Compare the following:

def Writing():
File_Name_HDF5='Test.h5'

#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)

# Writing_1 normal indexing
###########################################
f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

t1=time.time()
for i in range(shape[1]):
    d[:,i:i+1]=np.expand_dims(Array, 1)

f.close()
print(time.time()-t1)

# Writing_2 simplest form of fancy indexing
###########################################
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

#Writing columns
t1=time.time()
for i in range(shape[1]):
    d[:,i]=Array

f.close()
print(time.time()-t1)

This gives on my HDD 34 seconds for the first version and 78 seconds for the second version.

How to optimize sequential writes with h5py to increase speed when reading the file afterwards?

Your question is similar to a previous SO/h5py question I recently answered: h5py extremely slow writing. Apparently you are getting acceptable write performance, and want to improve read performance.

The 2 most important factors that affect h5py I/O performance are: 1) chunk size/shape, and 2) size of the I/O data block. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. I have also found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py. The size of your read data block is certainly large enough.

So, my initial hunch was to investigate chunk size influence on I/O performance. Setting the optimal chunk size is a bit of an art. Best way to tune the value is to enable chunking, let h5py define the default size, and see if you get acceptable performance. You didn't define the chunks parameter. However, because you defined the maxshape parameter, chunking was automatically enabled with a default size (based on the dataset's initial size). (Without chunking, I/O on a file of this size would be painfully slow.) An additional consideration for your problem: the optimal chunk size has to balance the size of the write data blocks (5000 x 40_000) vs the read data blocks (1 x 30_000_000).

I parameterized your code so I could tinker with the dimensions. When I did, I discovered something interesting. Reading the data is much faster when I run it as a separate process after creating the file. And, the default chunk size seems to give adequate read performance. (Initially I was going to benchmark different chunk size values.)

Note: I only created a 78GB file (4_000_000 columns). This takes >13mins to run on my Windows system. I didn't want to wait 90mins to create a 600GB file. You can modify n_blocks=750 if you want to test 30_000_000 columns. :-) All code at the end of this post.

Next I created a separate program to read the data. Read performance was fast with the default chunk size: (40, 625). Timing output below:

Time to read first row: 0.28 (in sec)
Time to read last row:  0.28

Interestingly, I did not get the same read times with every test. Values above were pretty consistent, but occasionally I would get a read time of 7-10 seconds. Not sure why that happens.

I ran 3 tests (In all cases block_to_write.shape=(500,40_000)):

default chunksize=(40,625) [95KB]; for 500x40_000 dataset (resized)
default chunksize=(10,15625) [596KB]; for 500x4_000_000 dataset (not resized)
user defined chunksize=(10,40_000) [1.526MB]; for 500x4_000_000 dataset (not resized)

Larger chunks improves read performance, but speed with default values is pretty fast. (Chunk size has a very small affect on write performance.) Output for all 3 below.

dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28

dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06

dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02

Code to create my test file below:

with h5py.File(fname, 'w') as fout:
    blocksize = 40_000
    n_blocks = 100
    n_rows = 5_000
    block_to_write = np.random.random((n_rows, blocksize))
    start = time.time()
    for cnt in range(n_blocks):
        incr = time.time()
        print(f'Working on loop: {cnt}', end='')
        if "data" not in fout:
            fout.create_dataset("data", shape=(n_rows,blocksize), 
                        maxshape=(n_rows, None)) #, chunks=(10,blocksize))            
        else:    
            fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
        
        fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
        print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')

Code to read 2 different arrays from the test file below:

with h5py.File(fname, 'r') as fin:
    print(f'dataset shape: {fin["data"].shape}')
    print(f'dataset chunkshape: {fin["data"].chunks}')
    start = time.time()
    data = fin["data"][0,:]
    print(f'Time to read first row: {time.time()-start:.2f}')
    start = time.time()
    data = fin["data"][-1,:]
    print(f'Time to read last row: {time.time()-start:.2f}'

Compression performance related to chunk size in hdf5 files

Chunking doesn't really affect the compression ratio per se, except in the manner @Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.

What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.

Optimal Hdf5 Dataset Chunk Shape for Reading Rows