Saving a Data Frame as a Binary File

Saving a data frame as a binary file

Your best bet is to use rda files. You can use the save() and load() commands to write and read:

set.seed(101)
a = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))

save(a, file="test.rda")
load("test.rda")

Edit: For completeness, just to cover what Harlan's suggestion might look like (i.e. wrapping the load command to return the data frame):

loadx <- function(x, file) {
load(file)
return(x)
}

loadx(a, "test.rda")

Alternatively, have a look at the hdf5, RNetCDF and ncdf packages. I've experimented with the hdf5 package in the past; this uses the NCSA HDF5 library. It's very simple:

hdf5save(fileout, ...)
hdf5load(file, load = TRUE, verbosity = 0, tidy = FALSE)

A last option is to use binary file connections, but that won't work well in your case because readBin and writeBin only support vectors:

Here's a trivial example. First write some data with "w" and append "b" to the connection:

zz <- file("testbin", "wb")
writeBin(1:10, zz)
close(zz)

Then read the data with "r" and append "b" to the connection:

zz <- file("testbin", "rb")
readBin(zz, integer(), 4)
close(zz)

How do I write a pandas dataframe to a binary file with specific formatting for multiple datatypes?

It turns out numpy arrays can only have one datatype, so it was trying to apply each datatype to each value -- hence the 4x4 array -- when I did .to_numpy(datatype). It was then writing that 4x4 array, resulting in the extra bytes.

Since pandas dataframes are based on numpy arrays anyway, it seems the answer is to specify the datatype on reading from CSV, then get the records from the dataframe and write those to binary.

import numpy as np
import pandas as pd

inputfilename = r"test_csv.csv"

datatype = np.dtype([
('val1', '>u4'),
('val2', '>u2'),
('val3', 'u1'),
('val4', '>f4')])

df = pd.read_csv(inputfilename,dtype=datatype)

dataonly = df.to_records(index=False)

outputfilename = r"output_py_1.dat"
fileobj = open(outputfilename, mode='wb')
dataonly.tofile(fileobj)
fileobj.close()

Edit: One more note -- if the data resists being labeled as big endian:

import sys    
if (sys.byteorder == 'little'):
dataonly = dataonly.byteswap()

how to convert pandas dataframe to binary file in python

You can convert your data frame values to int16 by using the astype function.

import numpy as np

df = df.astype(np.int16)

Then you can save your data frame in HDF5 format by using to_hdf.

df.to_hdf('tmp.hdf','df', mode='w')

How to reversibly store and load a Pandas dataframe to/from disk

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).


Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

import pandas as pd
store = pd.HDFStore('store.h5')

store['df'] = df # save it
store['df'] # load it

More advanced strategies are discussed in the cookbook.


Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).

How to keep hdf5 binary of a pandas dataframe in-memory?

The fix was to do conda install -c conda-forge pytables instead of pip install pytables. I still don't understand the ultimate reason behind the error, though.

How to store `pandas.DataFrame` in a PANDAS-LOADABLE binary format other than `pickle`

I would guess that your data frame is too big. Pickle has some limits. You are much better off either saving in a database or using to_hdf (or lots of other IO routines, to_msgpack might works as well).

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html

Python - Write in text mode to file opened in binary mode

This is quite broad question. Just briefly. This is all mostly about line endings. That's basically the only distinction between the binary and text modes.

  • If you "open" a file in the binary mode, all data are written exactly as they are. If you open a file in the text mode, newlines (\n) are converted according to the newline parameter.
  • I do not think that Pandas need the file to be opened in the text mode. If you open the file in the binary mode, then whatever Pandas writes will end up physically in the file. See line_terminatorstr parameter of the DataFrame.to_csv.
  • It's mostly the same with FTP. If you use storbinary, the file will be uploaded as is. If you use storlines, you let the FTP server convert the line endings.


Related Topics



Leave a reply



Submit