Adding Meta-Information/Metadata to Pandas Dataframe

Adding meta-information/metadata to pandas DataFrame

Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:

import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'

Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.

Preserving the metadata in a file is possible. You can find an example of how to store metadata in an HDF5 file here.

How to handle meta data associated with a pandas dataframe?

Although building a custom object is not your first choice, it might be your only feasible option, and has the significant advantage of being extremely flexible. Here's a really simple example:

df=pd.DataFrame({'stock': 'AAPL AAPL MSFT MSFT'.split(),
'price':[ 445.,455.,195.,205.]})

col_labels = { 'stock' : 'Ticker Symbol',
'price' : 'Closing Price in USD' }

That's just a dictionary of column labels, but often the majority of metadata is related to specific columns. Here's the sample data, with labels:

df.rename(columns=col_labels)

# Ticker Symbol Closing Price in USD
# 0 AAPL 445.0
# 1 AAPL 455.0
# 2 MSFT 195.0
# 3 MSFT 205.0

The nice thing is that the labels "persist" in the sense that you can basically apply them to any data whose columns are a subset or superset of the original columns:

df.groupby('stock').mean().rename(columns=col_labels)

# Closing Price in USD
# stock
# AAPL 450.0
# MSFT 200.0

You can get some limited persistence if you use the attrs attribute:

df.attrs = col_labels

But it's fairly limited. It will persist for dataframes derived via .copy(),loc[], or iloc[], but not for a groupby(). You can of course reattach to any derivative dataframe with, for example,

df2.attrs = df.attrs

But as noted in the documentation (or lack thereof), this is an experimental feature and subject to change. Seems slightly better than nothing, and maybe will be expanded in the future. I couldn't find much info at all regarding attrs, but it appears to be initialized as an empty dictionary, and can only be a dictionary (or similar) although of course lists could be nested below the top level.

How to add meta_data to Pandas dataframe?

This is not supported right now. See https://github.com/pydata/pandas/issues/2485. The reason is the propogation of these attributes is non-trivial. You can certainly assign data, but almost all pandas operations return a new object, where the assigned data will be lost.

Save additional attributes in Pandas Dataframe

There is an open issue regarding the storage of custom metadata in NDFrames. But due to the multitudinous ways pandas functions may return DataFrames, the _metadata attribute is not (yet) preserved in all situations.

For the time being, you'll just have to store the metadata in an auxilliary variable.

There are multiple options for storing DataFrames + metadata to files, depending on what format you wish to use -- pickle, JSON, HDF5 are all possibilities.

Here is how you could store and load a DataFrame with metadata using HDF5. The recipe for storing the metadata comes from the Pandas Cookbook.

import numpy as np
import pandas as pd

def h5store(filename, df, **kwargs):
store = pd.HDFStore(filename)
store.put('mydata', df)
store.get_storer('mydata').attrs.metadata = kwargs
store.close()

def h5load(store):
data = store['mydata']
metadata = store.get_storer('mydata').attrs.metadata
return data, metadata

a = pd.DataFrame(
data=pd.np.random.randint(0, 100, (10, 5)), columns=list('ABCED'))

filename = '/tmp/data.h5'
metadata = dict(local_tz='US/Eastern')
h5store(filename, a, **metadata)
with pd.HDFStore(filename) as store:
data, metadata = h5load(store)

print(data)
# A B C E D
# 0 9 20 92 43 25
# 1 2 64 54 0 63
# 2 22 42 3 83 81
# 3 3 71 17 64 53
# 4 52 10 41 22 43
# 5 48 85 96 72 88
# 6 10 47 2 10 78
# 7 30 80 3 59 16
# 8 13 52 98 79 65
# 9 6 93 55 40 3

print(metadata)

yields

{'local_tz': 'US/Eastern'}

How to read and write mixed metadata-data files using pandas

Not sure why you escaped newlines, so I removed in sample data

  • open file and read contents
  • take first five rows as meta header information
  • do a DF manipulation
  • save results back down to a file. Write meta data first followed by DF contents
from pathlib import Path

filetext = """Source: stackoverflow.com
Citation: stackoverflow et al. 2021: How to import and export mixed metadata - data files using pandas.
Date: 17.02.21
,,,
,,,
col_1,col_2,col_3
a,0,3
b,1,9
c,4,-2"""

p = Path.cwd().joinpath("so_science.txt")
with open(p, "w") as f:
f.write(filetext)

# get file contents
with open(p, "r") as f: fc = f.read()

# first five rows are metadata
header = "\n".join(fc.split("\n")[:5])
# reset is a CSV
df = pd.read_csv(io.StringIO("\n".join(fc.split("\n")[5:])))
# modify DF
df["col_2"] = df["col_2"] + df["col_3"]

# write out meta-data and CSV
with open(p, "w") as f:
f.write(f"{header}\n")
df.to_csv(f, index=False)



Related Topics



Leave a reply



Submit