best way to preserve numpy arrays on disk
I'm a big fan of hdf5 for storing large numpy arrays. There are two options for dealing with hdf5 in python:
http://www.pytables.org/
http://www.h5py.org/
Both are designed to work with numpy arrays efficiently.
How to save and load numpy.array() data properly?
The most reliable way I have found to do this is to use np.savetxt
with np.loadtxt
and not np.fromfile
which is better suited to binary files written with tofile
. The np.fromfile
and np.tofile
methods write and read binary files whereas np.savetxt
writes a text file.
So, for example:
a = np.array([1, 2, 3, 4])
np.savetxt('test1.txt', a, fmt='%d')
b = np.loadtxt('test1.txt', dtype=int)
a == b
# array([ True, True, True, True], dtype=bool)
Or:
a.tofile('test2.dat')
c = np.fromfile('test2.dat', dtype=int)
c == a
# array([ True, True, True, True], dtype=bool)
I use the former method even if it is slower and creates bigger files (sometimes): the binary format can be platform dependent (for example, the file format depends on the endianness of your system).
There is a platform independent format for NumPy arrays, which can be saved and read with np.save
and np.load
:
np.save('test3.npy', a) # .npy extension is added if not given
d = np.load('test3.npy')
a == d
# array([ True, True, True, True], dtype=bool)
Fastest save and load options for a numpy array
For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :
- NumPy.memmap, maps big arrays to binary form
- Pros :
- No dependency other than Numpy
- Transparent replacement of
ndarray
(Any class accepting ndarray acceptsmemmap
)
- Cons :
- Chunks of your array are limited to 2.5G
- Still limited by Numpy throughput
- Pros :
Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py
- Pros :
- Format supports compression, indexing, and other super nice features
- Apparently the ultimate PetaByte-large file format
- Cons :
- Learning curve of having a hierarchical format ?
- Have to define what your performance needs are (see later)
- Pros :
Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)
- Pros:
- It's Pythonic ! (haha)
- Supports all sorts of objects
- Cons:
- Probably slower than others (because aimed at any objects not arrays)
- Pros:
Numpy.memmap
From the docs of NumPy.memmap :
Create a memory-map to an array stored in a binary file on disk.
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory
The memmap object can be used anywhere an ndarray is accepted. Given any memmap
fp
,isinstance(fp, numpy.ndarray)
returns True.
HDF5 arrays
From the h5py doc
Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient
Is it possible to save boolean numpy arrays on disk as 1bit per element with memmap support?
numpy does not support 1 bit per element arrays, I doubt memmap has such a feature.
However, there is a simple workaround using packbits.
Since your case is not bitwise random access, you can read it as 1 byte per element array.
# A binary mask represented as an 1 byte per element array.
full_size_mask = np.random.randint(0, 2, size=[1920, 1080], dtype=np.uint8)
# Pack mask vertically.
packed_mask = np.packbits(full_size_mask, axis=0)
# Save as a memmap compatible file.
buffer = np.memmap("./temp.bin", mode='w+',
dtype=packed_mask.dtype, shape=packed_mask.shape)
buffer[:] = packed_mask
buffer.flush()
del buffer
# Open as a memmap file.
packed_mask = np.memmap("./temp.bin", mode='r',
dtype=packed_mask.dtype, shape=packed_mask.shape)
# Rect where you want to crop.
top = 555
left = 777
width = 256
height = 256
# Read the area containing the rect.
packed_top = top // 8
packed_bottom = (top + height) // 8 + 1
packed_patch = packed_mask[packed_top:packed_bottom, left:left + width]
# Unpack and crop the actual area.
patch_top = top - packed_top * 8
patch_mask = np.unpackbits(packed_patch, axis=0)[patch_top:patch_top + height]
# Check that the mask is cropped from the correct area.
print(np.all(patch_mask == full_size_mask[top:top + height, left:left + width]))
Note that this solution could (and likely will) read extra bits.
To be specific, 7 bits maximum at both ends.
In your case, it will be 7x2x256 bits, but this is only about 5% of the patch, so I believe it is negligible.
By the way, this is not an answer to your question, but when you are dealing with binary masks such as labels for image segmentation, compressing with zip may drastically reduce the file size.
It is possible that it could be reduced to less than 8 KB per image (not per patch).
You might want to consider this option as well.
Saving a numpy array in binary does not improve disk usage compared to uint8
The uint8
and bool
data types both occupy one byte of memory per element, so the arrays of equal dimensions are always going to occupy the same memory. If you are aiming to reduce your memory footprint, you can pack the boolean values as bits into a uint8 array using numpy.packbits
, thereby storing binary data in a significantly smaller array (read here)
How to save to disk several 2D np.arrays each having the second dimension variable in size?
You can use npz with key arguments to identify each array and thus no longer have a problem of non preserving order.
a = np.arange(10)
b = np.arange(5)*-1
f1={'my_key1':a}
f2={'my_key2':b}
np.savez_compressed('my_archive.npz', **f1, **f2)
Read your archive as follows:
>>>print(np.load('my_archive.npz')['my_key2'])
[ 0 -1 -2 -3 -4]
Edit
As mentioned in comments, the solution presented above is not viable if we want to automatically register N np.arrays.
The solution is therefore to create a single dictionary with N entries:
f = {}
f['my_key1'] = a
f['my_key2'] = b
np.savez_compressed('my_archive.npz', **f)
Related Topics
Django Upgrading to 1.9 Error "Appregistrynotready: Apps Aren't Loaded Yet."
Pandas: To_Numeric for Multiple Columns
Python Pip Install Module Is Not Found. How to Link Python to Pip Location
How to Find Char in String and Get All the Indexes
Inserting Line at Specified Position of a Text File
Python: How to Get Stdout After Running Os.System
How to Make a For-Loop Pyramid More Concise in Python
Only Extracting Text from This Element, Not Its Children
Pandas "Can Only Compare Identically-Labeled Dataframe Objects" Error
Interactive Input/Output Using Python
How to Retrieve Items from a Dictionary in the Order That They'Re Inserted
Python & MySQL: Unicode and Encoding
Why Do I Get Typeerror: Can't Multiply Sequence by Non-Int of Type 'Float'
Transparent Background in a Tkinter Window
How to Replace Two Things at Once in a String
Does a File Object Automatically Close When Its Reference Count Hits Zero