How to Read Hdf5 Files in Python

How to read HDF5 files in Python

Read HDF5

import h5py
filename = "file.hdf5"

with h5py.File(filename, "r") as f:
# Print all root level object names (aka keys)
# these can be group or dataset names
print("Keys: %s" % f.keys())
# get first object name/key; may or may NOT be a group
a_group_key = list(f.keys())[0]

# get the object type for a_group_key: usually group or dataset
print(type(f[a_group_key]))

# If a_group_key is a group name,
# this gets the object names in the group and returns as a list
data = list(f[a_group_key])

# If a_group_key is a dataset name,
# this gets the dataset values and returns as a list
data = list(f[a_group_key])
# preferred methods to get dataset values:
ds_obj = f[a_group_key] # returns as a h5py dataset object
ds_arr = f[a_group_key][()] # returns as a numpy array

Write HDF5

import h5py

# Create random data
import numpy as np
data_matrix = np.random.uniform(-1, 1, size=(10, 3))

# Write data to HDF5
with h5py.File("file.hdf5", "w") as data_file:
data_file.create_dataset("dataset_name", data=data_matrix)

See h5py docs for more information.

Alternatives

  • JSON: Nice for writing human-readable data; VERY commonly used (read & write)
  • CSV: Super simple format (read & write)
  • pickle: A Python serialization format (read & write)
  • MessagePack (Python package): More compact representation (read & write)
  • HDF5 (Python package): Nice for matrices (read & write)
  • XML: exists too *sigh* (read & write)

For your application, the following might be important:

  • Support by other programming languages
  • Reading / writing performance
  • Compactness (file size)

See also: Comparison of data serialization formats

In case you are rather looking for a way to make configuration files, you might want to read my short article Configuration files in Python

Dask: Read hdf5 and write to other hdf5 file

For anyone interested, I created a workaround which simply calls compute() on each block. Just sharing it, although I'm still interested in a better solution.

def to_hdf5(x, filename, datapath):
"""
Appends dask array to hdf5 file
"""
with h5.File(filename, "a") as f:
dset = f.require_dataset(datapath, shape=x.shape, dtype=x.dtype)

for block_ids in product(*[range(num) for num in x.numblocks]):
pos = [sum(x.chunks[dim][0 : block_ids[dim]]) for dim in range(len(block_ids))]
block = x.blocks[block_ids]
slices = tuple(slice(pos[i], pos[i] + block.shape[i]) for i in range(len(block_ids)))
dset[slices] = block.compute()

How can I read hdf5 files. and plot them as images

HDF5 is a container of arbitrary data organized into groups and datasets (aka the data schema). To effectively work with the data, you need to understand the schema before you start coding. Ideally, the data source provides the schema. If not, your first step is deducing the schema. You can do this by opening the file and viewing with HDFView (from the HDF Group), or writing little code snippets as shown in the linked answer.

I looked at your file. You said you want to "see the images". You can't do that with this data. I read the file descriptions here: DeepMoon Supplemental Materials. There are 6 files of interest:

  • name_craters.hdf5 - Pandas HDFStore of crater locations and sizes for images in the dataset.
  • name_images.hdf5 - Input DEM images and output targets of the dataset, where:
    • name = dev for the validation dataset
    • name = test for the test dataset
    • name = train for the training dataset

So, if you want the training image data you need to download the train_images.hdf5 file. Warning: it is 9.9 GB.

Comments about the train_craters.hdf5 file:

This file was created by Pandas. The file has 30_000 groups, 1 for each image (named "img_xxxxx"). Each group has 4 datasets named: "axis_0", "axis_1", "block0_items", and "block0_values". They have data about each image, but not any image data. For example, both "axis_0" and "block0_items" has the following entries:

Diameter (km)
Lat
Long
x
y
Diameter (pix)

There is data in "block0_values". Here is an example from "img_00000/block0_values":

[[ 5.32341731 -35.10135397 -101.80962272 161.77188631 252.6564721 10.87213217]  
[ 5.38713978 -34.86402264 -102.38375512 132.62561605 237.8560143 11.00227398]]

From this you get:

Diameter (km)[0] = 5.32341731
Lat[0] = -35.10135397
Long[0] = -101.80962272
x[0] = 161.77188631
y[0] = 252.6564721
Diameter (pix)[0] = 10.87213217

Diameter (km)[1] = 5.38713978
Lat[1] = -34.86402264
Long[1] = -102.38375512
x[1] = 132.62561605
y[1] = 237.8560143
Diameter (pix)[1] = 11.00227398

So, that provides some basic info about each image...but not an array of pixel values you can covert into an image.

Open .h5 file in Python

In order to open a HDF5 file with the h5py module you can use h5py.File(filename). The documentation can be found here.

import h5py

filename = "vstoxx_data_31032014.h5"

h5 = h5py.File(filename,'r')

futures_data = h5['futures_data'] # VSTOXX futures data
options_data = h5['options_data'] # VSTOXX call option data

h5.close()


Related Topics



Leave a reply



Submit