How to flatten a pandas dataframe with some columns as json?
Here's a solution using json_normalize()
again by using a custom function to get the data in the correct format understood by json_normalize
function.
import ast
from pandas.io.json import json_normalize
def only_dict(d):
'''
Convert json string representation of dictionary to a python dict
'''
return ast.literal_eval(d)
def list_of_dicts(ld):
'''
Create a mapping of the tuples formed after
converting json strings of list to a python list
'''
return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])
A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.')
B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.')
Finally, join the DFs
on the common index to get:df[['id', 'name']].join([A, B])
EDIT:- As per the comment by @MartijnPieters, the recommended way of decoding the json strings would be to use
json.loads()
which is much faster when compared to using ast.literal_eval()
if you know that the data source is JSON. Flatten JSON Columns in Dataframe
You can use pd.json_normalize
which should be more simple.
>>> df
ID PROPERTIES FORMSUBMISSIONS
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... [{'contact-associated-by': ['FAKE'], 'conversi...
>>> df = df.explode('FORMSUBMISSIONS') # list to dict
>>> df
ID PROPERTIES FORMSUBMISSIONS
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... {'contact-associated-by': ['FAKE'], 'conversio...
Now you can do json_normalize
on the FORMSUBMISSIONS column. To preserve the other columns, I use pd.concat
>>> df = pd.concat([df, pd.json_normalize(df['FORMSUBMISSIONS']), axis=1).drop('FORMSUBMISSIONS', axis=1)
>>> df
ID PROPERTIES contact-associated-by conversion-id form-id form-type meta-data portal-id timestamp title
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... [FAKE] FAKE FAKE FAKE [] FAKE FAKE FAKE
You can do the same thing on PROPERTIES column.df = pd.concat([df, pd.json_normalize(df.PROPERTIES)], axis=1).drop('PROPERTIES', axis=1)
Flatten nested JSON columns in Pandas
Get values from dicts and transform each element of the list to a row with explode
while index is duplicated. Then, expand the nested dict (values of your first dict) to columns. Finally, you have to join your original dataframe with the new dataframe.
>>> df
stock Name Annual
0 x Tesla {'0': {'date': '2020', 'dateFormatted': '2020-...
1 y Google {'0': {'date': '2020', 'dateFormatted': '2020-...
2 z Big Apple {}
data = df['Annual'].apply(lambda x: x.values()) \
.explode() \
.apply(pd.Series)
df = df.join(data).drop(columns='Annual')
Output result:>>> df
stock Name date dateFormatted sharesMln shares
0 x Tesla 2020 2020-12-31 3856.2405 3.856240e+09
0 x Tesla 2019 2019-12-31 3856.2405 3.856240e+09
1 y Google 2020 2020-12-31 2526.4506 2.526451e+09
1 y Google 2019 2019-12-31 2526.4506 2.526451e+09
1 y Google 2018 2018-12-31 2578.0992 2.578099e+09
2 z Big Apple NaN NaN NaN NaN
Flatten JSON columns in a dataframe with lists
Idea is use dictionary comprehension with column flatten
for i
for index values, so after concat
is possible join to original DataFrame:
x = '''{"sections":
[{
"id": "12ab",
"items": [
{"id": "34cd",
"isValid": true,
"questionaire": {"title": "blah blah", "question": "Date of Purchase"}
},
{"id": "56ef",
"isValid": true,
"questionaire": {"title": "something useless", "question": "Date of Billing"}
}
]
}],
"ignore": "yes"}'''
df = pd.DataFrame({'id':['1','2'], 'name':['xyz', 'abc'],
'location':['new york', 'wien'], 'flatten':[x,x]})
#create default RangeIndex
df = df.reset_index(drop=True)
d = {i: pd.json_normalize(json.loads(x)['sections'],
'items', ['id'],
record_prefix='child_')[['id','child_id','child_questionaire.question']]
.rename(columns={'child_questionaire.question':'question'})
for i, x in df.pop('flatten').items()}
df_norm = df.rename(columns={'id':'Masterid'}).join(pd.concat(d).reset_index(level=1, drop=True))
print (df_norm)
Masterid name location id child_id question
0 1 xyz new york 12ab 34cd Date of Purchase
0 1 xyz new york 12ab 56ef Date of Billing
1 2 abc wien 12ab 34cd Date of Purchase
1 2 abc wien 12ab 56ef Date of Billing
Related Topics
Why Do We Need to Call Zero_Grad() in Pytorch
Numpy Array Initialization (Fill with Identical Values)
Change to Sudo User Within a Python Script
Pycharm: Set Environment Variable for Run Manage.Py Task
Python Memory Usage of Numpy Arrays
How to Plot Only a Table in Matplotlib
What Are All the Dtypes That Pandas Recognizes
Wrapping Around on a List When List Index Is Out of Range
How to Change Tcp Keepalive Timer Using Python Script
Python Sorting by Multiple Criteria
Runtimeerror: This Event Loop Is Already Running in Python
Filename and Line Number of Python Script
Cheap Way to Search a Large Text File for a String
How to Avoid Infinite Recursion with Super()
How to Convert a Python List into a C Array by Using Ctypes