How to flatten a nested JSON recursively, with flatten_json
How to flatten a JSON
or dict
is a common question, to which there are many answers.
- This answer focuses on using
flatten_json
to recursively flatten a nesteddict
orJSON
.
Assumptions:
- This answer assumes you already have the
JSON
ordict
loaded into some variable (e.g. file, api, etc.)- In this case we will use
data
- In this case we will use
How is data
loaded into flatten_json
:
- It accepts a
dict
, as shown by the function type hint.
The most common forms of data
:
- Just a dict:
{}
flatten_json(data)
- List of dicts:
[{}, {}, {}]
[flatten_json(x) for x in data]
- JSON with with top level keys, where the values repeat:
{1: {}, 2: {}, 3: {}}
[flatten_json(data[key]) for key in data]
- Other
{'key': [{}, {}, {}]}
:[flatten_json(x) for x in data['key']]
Practical Examples:
- I typically flatten
data
into apandas.DataFrame
for further analysis.- Load
pandas
withimport pandas as pd
- Load
flatten_json
returns adict
, which can be saved directly using thecsv
packages.
Data 1:
{
"id": 1,
"class": "c1",
"owner": "myself",
"metadata": {
"m1": {
"value": "m1_1",
"timestamp": "d1"
},
"m2": {
"value": "m1_2",
"timestamp": "d2"
},
"m3": {
"value": "m1_3",
"timestamp": "d3"
},
"m4": {
"value": "m1_4",
"timestamp": "d4"
}
},
"a1": {
"a11": [
]
},
"m1": {},
"comm1": "COMM1",
"comm2": "COMM21529089656387",
"share": "xxx",
"share1": "yyy",
"hub1": "h1",
"hub2": "h2",
"context": [
]
}
Flatten 1:
df = pd.DataFrame([flatten_json(data)])
id class owner metadata_m1_value metadata_m1_timestamp metadata_m2_value metadata_m2_timestamp metadata_m3_value metadata_m3_timestamp metadata_m4_value metadata_m4_timestamp comm1 comm2 share share1 hub1 hub2
1 c1 myself m1_1 d1 m1_2 d2 m1_3 d3 m1_4 d4 COMM1 COMM21529089656387 xxx yyy h1 h2
Data 2:
[{
'accuracy': 17,
'activity': [{
'activity': [{
'confidence': 100,
'type': 'STILL'
}
],
'timestampMs': '1542652'
}
],
'altitude': -10,
'latitudeE7': 3777321,
'longitudeE7': -122423125,
'timestampMs': '1542654',
'verticalAccuracy': 2
}, {
'accuracy': 17,
'activity': [{
'activity': [{
'confidence': 100,
'type': 'STILL'
}
],
'timestampMs': '1542652'
}
],
'altitude': -10,
'latitudeE7': 3777321,
'longitudeE7': -122423125,
'timestampMs': '1542654',
'verticalAccuracy': 2
}, {
'accuracy': 17,
'activity': [{
'activity': [{
'confidence': 100,
'type': 'STILL'
}
],
'timestampMs': '1542652'
}
],
'altitude': -10,
'latitudeE7': 3777321,
'longitudeE7': -122423125,
'timestampMs': '1542654',
'verticalAccuracy': 2
}
]
Flatten 2:
df = pd.DataFrame([flatten_json(x) for x in data])
accuracy activity_0_activity_0_confidence activity_0_activity_0_type activity_0_timestampMs altitude latitudeE7 longitudeE7 timestampMs verticalAccuracy
17 100 STILL 1542652 -10 3777321 -122423125 1542654 2
17 100 STILL 1542652 -10 3777321 -122423125 1542654 2
17 100 STILL 1542652 -10 3777321 -122423125 1542654 2
Data 3:
{
"1": {
"VENUE": "JOEBURG",
"COUNTRY": "HAE",
"ITW": "XAD",
"RACES": {
"1": {
"NO": 1,
"TIME": "12:35"
},
"2": {
"NO": 2,
"TIME": "13:10"
},
"3": {
"NO": 3,
"TIME": "13:40"
},
"4": {
"NO": 4,
"TIME": "14:10"
},
"5": {
"NO": 5,
"TIME": "14:55"
},
"6": {
"NO": 6,
"TIME": "15:30"
},
"7": {
"NO": 7,
"TIME": "16:05"
},
"8": {
"NO": 8,
"TIME": "16:40"
}
}
},
"2": {
"VENUE": "FOOBURG",
"COUNTRY": "ABA",
"ITW": "XAD",
"RACES": {
"1": {
"NO": 1,
"TIME": "12:35"
},
"2": {
"NO": 2,
"TIME": "13:10"
},
"3": {
"NO": 3,
"TIME": "13:40"
},
"4": {
"NO": 4,
"TIME": "14:10"
},
"5": {
"NO": 5,
"TIME": "14:55"
},
"6": {
"NO": 6,
"TIME": "15:30"
},
"7": {
"NO": 7,
"TIME": "16:05"
},
"8": {
"NO": 8,
"TIME": "16:40"
}
}
}
}
Flatten 3:
df = pd.DataFrame([flatten_json(data[key]) for key in data])
VENUE COUNTRY ITW RACES_1_NO RACES_1_TIME RACES_2_NO RACES_2_TIME RACES_3_NO RACES_3_TIME RACES_4_NO RACES_4_TIME RACES_5_NO RACES_5_TIME RACES_6_NO RACES_6_TIME RACES_7_NO RACES_7_TIME RACES_8_NO RACES_8_TIME
JOEBURG HAE XAD 1 12:35 2 13:10 3 13:40 4 14:10 5 14:55 6 15:30 7 16:05 8 16:40
FOOBURG ABA XAD 1 12:35 2 13:10 3 13:40 4 14:10 5 14:55 6 15:30 7 16:05 8 16:40
Other Examples:
- Python Pandas - Flatten Nested JSON
- handling nested json in pandas
- How to flatten a nested JSON from the NASA Weather Insight API in Python
Flatten a nested JSON?
flatten_json is a library now, so you can do this. It'll give you 160 columns
from flatten_json import flatten
dic_flattened = (flatten(d, '.') for d in test_json['result'])
df = pd.DataFrame(dic_flattened)
df.shape
(5, 160)
How to flatten multilevel/nested JSON?
I used the following function (details can be found here):
def flatten_data(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
This unfortunately completely flattens whole JSON, meaning that if you have multi-level JSON (many nested dictionaries), it might flatten everything into single line with tons of columns.
What I used, in the end, was json_normalize()
and specified structure that I required. A nice example of how to do it that way can be found here.
flatten_json recursive flattening function for lists
I solved it using recursion, here's my code:
import json
import pandas as pd
import flatten_json as fj
keys = {'data', 'level1', 'level2', 'level3'}
with open('test_lh.json') as f:
data = json.load(f)
levels = ['data.level1.level2.level3', 'data.level1.level2', 'data.level1', 'data']
recs_dict = {}
def do_step(data_dict, level, depth, path):
recs = []
for x in data_dict[level]:
if depth < len(path.split('.'))-1:
do_step(x, path.split('.')[depth+1], depth+1, path)
else:
dic = fj.flatten(x, root_keys_to_ignore=keys)
recs.append(dic)
recs_dict[level] = recs
for path in levels:
do_step(data, path.split('.')[0], 0, path)
for key, value in recs_dict.items():
print(key)
df = pd.DataFrame(recs_dict[key])
print(df)
And here's the output:
level3
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 abs level3
1 abc def 123 abc def 123 abs level3
level2
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 abs level2
1 abc def 123 abc def 123 abs abd
level1
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 asd level1
data
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 qwer abd
Flatten Nested JSON in Python
The error you got indicates you missed that some of your values are actually a dictionary within an array.
Assuming you want to flatten your json file to retrieve the following keys: mediaType
, queueId
, count
.
These can be retrieved by the following sample code:
import json
with open(path_to_json_file, 'r') as f:
json_dict = json.load(f)
for result in json_dict.get("results"):
media_type = result.get("group").get("mediaType")
queue_id = result.get("group").get("queueId")
n_offered = result.get("data")[0].get("metrics")[0].get("count")
If your data
and metrics
keys will have multiple indices you will have to use a for
loop to retrieve every count
value accordingly.
Flatten a tripled nested json into a dataframe
It is similar to what you have in Edit, but perhaps slightly shorter syntax and more performant.
If you have NaN in the DataFrame, older version of Pandas could fail on json_normalize
.
This solution should work with Pandas 1.3+.
df = pd.json_normalize(products)
df = df.explode('properties.features')
df = pd.concat([df.drop('properties.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features']).add_prefix('properties.features.')], axis=1)
df = df.explode('properties.features.features')
df = pd.concat([df.drop('properties.features.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features.features']).add_prefix('properties.features.features.')], axis=1)
Perf. with 1000 products.
Code in Edit : 4.85 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This solution: 58.3 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Fastest way to flatten / un-flatten nested JavaScript objects
Here's my much shorter implementation:
Object.unflatten = function(data) {
"use strict";
if (Object(data) !== data || Array.isArray(data))
return data;
var regex = /\.?([^.\[\]]+)|\[(\d+)\]/g,
resultholder = {};
for (var p in data) {
var cur = resultholder,
prop = "",
m;
while (m = regex.exec(p)) {
cur = cur[prop] || (cur[prop] = (m[2] ? [] : {}));
prop = m[2] || m[1];
}
cur[prop] = data[p];
}
return resultholder[""] || resultholder;
};
flatten
hasn't changed much (and I'm not sure whether you really need those isEmpty
cases):
Object.flatten = function(data) {
var result = {};
function recurse (cur, prop) {
if (Object(cur) !== cur) {
result[prop] = cur;
} else if (Array.isArray(cur)) {
for(var i=0, l=cur.length; i<l; i++)
recurse(cur[i], prop + "[" + i + "]");
if (l == 0)
result[prop] = [];
} else {
var isEmpty = true;
for (var p in cur) {
isEmpty = false;
recurse(cur[p], prop ? prop+"."+p : p);
}
if (isEmpty && prop)
result[prop] = {};
}
}
recurse(data, "");
return result;
}
Together, they run your benchmark in about the half of the time (Opera 12.16: ~900ms instead of ~ 1900ms, Chrome 29: ~800ms instead of ~1600ms).
Note: This and most other solutions answered here focus on speed and are susceptible to prototype pollution and shold not be used on untrusted objects.
Related Topics
How to Read the Rgb Value of a Given Pixel in Python
How to Check If a String in Python Is in Ascii
Datetime Dtypes in Pandas Read_Csv
Remove and Replace Printed Items
How to Check If a Python Module Exists Without Importing It
Scraping Dynamic Content Using Python-Scrapy
Extract Images from PDF Without Resampling, in Python
What Does 'Valueerror: Cannot Reindex from a Duplicate Axis' Mean
Dynamically Add Field to a Form
Repeating Elements of a List N Times
How to Find the Location of Python Module Sources
Most Recent Previous Business Day in Python
Why Does Assigning to My Global Variables Not Work in Python
Pandas Dataframe: Replace All Values in a Column, Based on Condition