Saving a data frame as a binary file
Your best bet is to use rda files. You can use the save()
and load()
commands to write and read:
set.seed(101)
a = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))
save(a, file="test.rda")
load("test.rda")
Edit: For completeness, just to cover what Harlan's suggestion might look like (i.e. wrapping the load command to return the data frame):
loadx <- function(x, file) {
load(file)
return(x)
}
loadx(a, "test.rda")
Alternatively, have a look at the hdf5, RNetCDF and ncdf packages. I've experimented with the hdf5 package in the past; this uses the NCSA HDF5 library. It's very simple:
hdf5save(fileout, ...)
hdf5load(file, load = TRUE, verbosity = 0, tidy = FALSE)
A last option is to use binary file connections, but that won't work well in your case because readBin and writeBin only support vectors:
Here's a trivial example. First write some data with "w" and append "b" to the connection:
zz <- file("testbin", "wb")
writeBin(1:10, zz)
close(zz)
Then read the data with "r" and append "b" to the connection:
zz <- file("testbin", "rb")
readBin(zz, integer(), 4)
close(zz)
How do I write a pandas dataframe to a binary file with specific formatting for multiple datatypes?
It turns out numpy arrays can only have one datatype, so it was trying to apply each datatype to each value -- hence the 4x4 array -- when I did .to_numpy(datatype). It was then writing that 4x4 array, resulting in the extra bytes.
Since pandas dataframes are based on numpy arrays anyway, it seems the answer is to specify the datatype on reading from CSV, then get the records from the dataframe and write those to binary.
import numpy as np
import pandas as pd
inputfilename = r"test_csv.csv"
datatype = np.dtype([
('val1', '>u4'),
('val2', '>u2'),
('val3', 'u1'),
('val4', '>f4')])
df = pd.read_csv(inputfilename,dtype=datatype)
dataonly = df.to_records(index=False)
outputfilename = r"output_py_1.dat"
fileobj = open(outputfilename, mode='wb')
dataonly.tofile(fileobj)
fileobj.close()
Edit: One more note -- if the data resists being labeled as big endian:
import sys
if (sys.byteorder == 'little'):
dataonly = dataonly.byteswap()
how to convert pandas dataframe to binary file in python
You can convert your data frame values to int16 by using the astype
function.
import numpy as np
df = df.astype(np.int16)
Then you can save your data frame in HDF5 format by using to_hdf
.
df.to_hdf('tmp.hdf','df', mode='w')
How to reversibly store and load a Pandas dataframe to/from disk
The easiest way is to pickle it using to_pickle
:
df.to_pickle(file_name) # where to save it, usually as a .pkl
Then you can load it back using:
df = pd.read_pickle(file_name)
Note: before 0.11.1 save
and load
were the only way to do this (they are now deprecated in favor of to_pickle
and read_pickle
respectively).
Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:
import pandas as pd
store = pd.HDFStore('store.h5')
store['df'] = df # save it
store['df'] # load it
More advanced strategies are discussed in the cookbook.
Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).
How to keep hdf5 binary of a pandas dataframe in-memory?
The fix was to do conda install -c conda-forge pytables
instead of pip install pytables
. I still don't understand the ultimate reason behind the error, though.
How to store `pandas.DataFrame` in a PANDAS-LOADABLE binary format other than `pickle`
I would guess that your data frame is too big. Pickle has some limits. You are much better off either saving in a database or using to_hdf (or lots of other IO routines, to_msgpack might works as well).
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html
Python - Write in text mode to file opened in binary mode
This is quite broad question. Just briefly. This is all mostly about line endings. That's basically the only distinction between the binary and text modes.
- If you "open" a file in the binary mode, all data are written exactly as they are. If you open a file in the text mode, newlines (
\n
) are converted according to thenewline
parameter. - I do not think that Pandas need the file to be opened in the text mode. If you open the file in the binary mode, then whatever Pandas writes will end up physically in the file. See
line_terminatorstr
parameter of theDataFrame.to_csv
. - It's mostly the same with FTP. If you use
storbinary
, the file will be uploaded as is. If you usestorlines
, you let the FTP server convert the line endings.
Related Topics
Merging Data Frames with Different Number of Rows and Different Columns
Fill Area Between Two Lines, with High/Low and Dates
Align Edges of Ggplot Choropleth (Legend Title Varies)
Rotate Labels in a Chorddiagram (R Circlize)
Shiny: Open New Browser Tab from Within Shiny App
Transfer Values from One Dataframe to Another
How to Make Scatterplot Points Open a Hyperlink Using Ggplotly - R
Getting Both Column Counts and Proportions in the Same Table in R
R Memory Management Advice (Caret, Model Matrices, Data Frames)
Convert Lat/Lon to Zipcode/Neighborhood Name
More Efficient Strategy for Which() or Match()
How to Build Multiclass Svm in R
How to Ddply() Without Sorting
Reproduce a 'The Economist' Chart with Dual Axis
Error in R Gbm Function When Cv.Folds > 0
R: Adding Alpha Bags to a 2D or 3D Scatterplot
Raster Image Goes Below Base Layer, While Markers Stay Above: Xindex Is Ignored