How to Flatten a Pandas Dataframe with Some Columns as JSON

How to flatten a pandas dataframe with some columns as json?

Here's a solution using json_normalize() again by using a custom function to get the data in the correct format understood by json_normalize function.

import ast
from pandas.io.json import json_normalize

def only_dict(d):
'''
Convert json string representation of dictionary to a python dict
'''
return ast.literal_eval(d)

def list_of_dicts(ld):
'''
Create a mapping of the tuples formed after
converting json strings of list to a python list
'''
return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])

A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.')
B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.')

Finally, join the DFs on the common index to get:

df[['id', 'name']].join([A, B])

Image


EDIT:- As per the comment by @MartijnPieters, the recommended way of decoding the json strings would be to use json.loads() which is much faster when compared to using ast.literal_eval() if you know that the data source is JSON.

Flatten JSON Columns in Dataframe

You can use pd.json_normalize which should be more simple.

>>> df
ID PROPERTIES FORMSUBMISSIONS
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... [{'contact-associated-by': ['FAKE'], 'conversi...

>>> df = df.explode('FORMSUBMISSIONS') # list to dict
>>> df
ID PROPERTIES FORMSUBMISSIONS
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... {'contact-associated-by': ['FAKE'], 'conversio...

Now you can do json_normalize on the FORMSUBMISSIONS column. To preserve the other columns, I use pd.concat

>>> df = pd.concat([df, pd.json_normalize(df['FORMSUBMISSIONS']), axis=1).drop('FORMSUBMISSIONS', axis=1)

>>> df
ID PROPERTIES contact-associated-by conversion-id form-id form-type meta-data portal-id timestamp title
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... [FAKE] FAKE FAKE FAKE [] FAKE FAKE FAKE

You can do the same thing on PROPERTIES column.

df = pd.concat([df, pd.json_normalize(df.PROPERTIES)], axis=1).drop('PROPERTIES', axis=1)

Flatten nested JSON columns in Pandas

Get values from dicts and transform each element of the list to a row with explode while index is duplicated. Then, expand the nested dict (values of your first dict) to columns. Finally, you have to join your original dataframe with the new dataframe.

>>> df

stock Name Annual
0 x Tesla {'0': {'date': '2020', 'dateFormatted': '2020-...
1 y Google {'0': {'date': '2020', 'dateFormatted': '2020-...
2 z Big Apple {}
data = df['Annual'].apply(lambda x: x.values()) \
.explode() \
.apply(pd.Series)

df = df.join(data).drop(columns='Annual')

Output result:

>>> df

stock Name date dateFormatted sharesMln shares
0 x Tesla 2020 2020-12-31 3856.2405 3.856240e+09
0 x Tesla 2019 2019-12-31 3856.2405 3.856240e+09
1 y Google 2020 2020-12-31 2526.4506 2.526451e+09
1 y Google 2019 2019-12-31 2526.4506 2.526451e+09
1 y Google 2018 2018-12-31 2578.0992 2.578099e+09
2 z Big Apple NaN NaN NaN NaN

Flatten JSON columns in a dataframe with lists

Idea is use dictionary comprehension with column flatten for i for index values, so after concat is possible join to original DataFrame:

x = '''{"sections": 
[{
"id": "12ab",
"items": [
{"id": "34cd",
"isValid": true,
"questionaire": {"title": "blah blah", "question": "Date of Purchase"}
},
{"id": "56ef",
"isValid": true,
"questionaire": {"title": "something useless", "question": "Date of Billing"}
}
]
}],
"ignore": "yes"}'''

df = pd.DataFrame({'id':['1','2'], 'name':['xyz', 'abc'],
'location':['new york', 'wien'], 'flatten':[x,x]})


#create default RangeIndex
df = df.reset_index(drop=True)

d = {i: pd.json_normalize(json.loads(x)['sections'],
'items', ['id'],
record_prefix='child_')[['id','child_id','child_questionaire.question']]
.rename(columns={'child_questionaire.question':'question'})
for i, x in df.pop('flatten').items()}

df_norm = df.rename(columns={'id':'Masterid'}).join(pd.concat(d).reset_index(level=1, drop=True))


print (df_norm)
Masterid name location id child_id question
0 1 xyz new york 12ab 34cd Date of Purchase
0 1 xyz new york 12ab 56ef Date of Billing
1 2 abc wien 12ab 34cd Date of Purchase
1 2 abc wien 12ab 56ef Date of Billing


Related Topics



Leave a reply



Submit