How to Flatten a Nested JSON Recursively, with Flatten_JSON

How to flatten a nested JSON recursively, with flatten_json

How to flatten a JSON or dict is a common question, to which there are many answers.

  • This answer focuses on using flatten_json to recursively flatten a nested dict or JSON.

Assumptions:

  • This answer assumes you already have the JSON or dict loaded into some variable (e.g. file, api, etc.)
    • In this case we will use data

How is data loaded into flatten_json:

  • It accepts a dict, as shown by the function type hint.

The most common forms of data:

  • Just a dict: {}
    • flatten_json(data)
  • List of dicts: [{}, {}, {}]
    • [flatten_json(x) for x in data]
  • JSON with with top level keys, where the values repeat: {1: {}, 2: {}, 3: {}}
    • [flatten_json(data[key]) for key in data]
  • Other
    • {'key': [{}, {}, {}]}: [flatten_json(x) for x in data['key']]

Practical Examples:

  • I typically flatten data into a pandas.DataFrame for further analysis.
    • Load pandas with import pandas as pd
  • flatten_json returns a dict, which can be saved directly using the csv packages.


Data 1:

{
"id": 1,
"class": "c1",
"owner": "myself",
"metadata": {
"m1": {
"value": "m1_1",
"timestamp": "d1"
},
"m2": {
"value": "m1_2",
"timestamp": "d2"
},
"m3": {
"value": "m1_3",
"timestamp": "d3"
},
"m4": {
"value": "m1_4",
"timestamp": "d4"
}
},
"a1": {
"a11": [

]
},
"m1": {},
"comm1": "COMM1",
"comm2": "COMM21529089656387",
"share": "xxx",
"share1": "yyy",
"hub1": "h1",
"hub2": "h2",
"context": [

]
}

Flatten 1:

df = pd.DataFrame([flatten_json(data)])

id class owner metadata_m1_value metadata_m1_timestamp metadata_m2_value metadata_m2_timestamp metadata_m3_value metadata_m3_timestamp metadata_m4_value metadata_m4_timestamp comm1 comm2 share share1 hub1 hub2
1 c1 myself m1_1 d1 m1_2 d2 m1_3 d3 m1_4 d4 COMM1 COMM21529089656387 xxx yyy h1 h2


Data 2:

[{
'accuracy': 17,
'activity': [{
'activity': [{
'confidence': 100,
'type': 'STILL'
}
],
'timestampMs': '1542652'
}
],
'altitude': -10,
'latitudeE7': 3777321,
'longitudeE7': -122423125,
'timestampMs': '1542654',
'verticalAccuracy': 2
}, {
'accuracy': 17,
'activity': [{
'activity': [{
'confidence': 100,
'type': 'STILL'
}
],
'timestampMs': '1542652'
}
],
'altitude': -10,
'latitudeE7': 3777321,
'longitudeE7': -122423125,
'timestampMs': '1542654',
'verticalAccuracy': 2
}, {
'accuracy': 17,
'activity': [{
'activity': [{
'confidence': 100,
'type': 'STILL'
}
],
'timestampMs': '1542652'
}
],
'altitude': -10,
'latitudeE7': 3777321,
'longitudeE7': -122423125,
'timestampMs': '1542654',
'verticalAccuracy': 2
}
]

Flatten 2:

df = pd.DataFrame([flatten_json(x) for x in data])

accuracy activity_0_activity_0_confidence activity_0_activity_0_type activity_0_timestampMs altitude latitudeE7 longitudeE7 timestampMs verticalAccuracy
17 100 STILL 1542652 -10 3777321 -122423125 1542654 2
17 100 STILL 1542652 -10 3777321 -122423125 1542654 2
17 100 STILL 1542652 -10 3777321 -122423125 1542654 2


Data 3:

{
"1": {
"VENUE": "JOEBURG",
"COUNTRY": "HAE",
"ITW": "XAD",
"RACES": {
"1": {
"NO": 1,
"TIME": "12:35"
},
"2": {
"NO": 2,
"TIME": "13:10"
},
"3": {
"NO": 3,
"TIME": "13:40"
},
"4": {
"NO": 4,
"TIME": "14:10"
},
"5": {
"NO": 5,
"TIME": "14:55"
},
"6": {
"NO": 6,
"TIME": "15:30"
},
"7": {
"NO": 7,
"TIME": "16:05"
},
"8": {
"NO": 8,
"TIME": "16:40"
}
}
},
"2": {
"VENUE": "FOOBURG",
"COUNTRY": "ABA",
"ITW": "XAD",
"RACES": {
"1": {
"NO": 1,
"TIME": "12:35"
},
"2": {
"NO": 2,
"TIME": "13:10"
},
"3": {
"NO": 3,
"TIME": "13:40"
},
"4": {
"NO": 4,
"TIME": "14:10"
},
"5": {
"NO": 5,
"TIME": "14:55"
},
"6": {
"NO": 6,
"TIME": "15:30"
},
"7": {
"NO": 7,
"TIME": "16:05"
},
"8": {
"NO": 8,
"TIME": "16:40"
}
}
}
}

Flatten 3:

df = pd.DataFrame([flatten_json(data[key]) for key in data])

VENUE COUNTRY ITW RACES_1_NO RACES_1_TIME RACES_2_NO RACES_2_TIME RACES_3_NO RACES_3_TIME RACES_4_NO RACES_4_TIME RACES_5_NO RACES_5_TIME RACES_6_NO RACES_6_TIME RACES_7_NO RACES_7_TIME RACES_8_NO RACES_8_TIME
JOEBURG HAE XAD 1 12:35 2 13:10 3 13:40 4 14:10 5 14:55 6 15:30 7 16:05 8 16:40
FOOBURG ABA XAD 1 12:35 2 13:10 3 13:40 4 14:10 5 14:55 6 15:30 7 16:05 8 16:40


Other Examples:

  1. Python Pandas - Flatten Nested JSON
  2. handling nested json in pandas
  3. How to flatten a nested JSON from the NASA Weather Insight API in Python

Flatten a nested JSON?

flatten_json is a library now, so you can do this. It'll give you 160 columns

from flatten_json import flatten
dic_flattened = (flatten(d, '.') for d in test_json['result'])
df = pd.DataFrame(dic_flattened)

df.shape
(5, 160)

How to flatten multilevel/nested JSON?

I used the following function (details can be found here):

def flatten_data(y):
out = {}

def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x

flatten(y)
return out

This unfortunately completely flattens whole JSON, meaning that if you have multi-level JSON (many nested dictionaries), it might flatten everything into single line with tons of columns.

What I used, in the end, was json_normalize() and specified structure that I required. A nice example of how to do it that way can be found here.

flatten_json recursive flattening function for lists

I solved it using recursion, here's my code:

import json
import pandas as pd
import flatten_json as fj

keys = {'data', 'level1', 'level2', 'level3'}
with open('test_lh.json') as f:
data = json.load(f)

levels = ['data.level1.level2.level3', 'data.level1.level2', 'data.level1', 'data']
recs_dict = {}

def do_step(data_dict, level, depth, path):
recs = []
for x in data_dict[level]:
if depth < len(path.split('.'))-1:
do_step(x, path.split('.')[depth+1], depth+1, path)
else:
dic = fj.flatten(x, root_keys_to_ignore=keys)
recs.append(dic)
recs_dict[level] = recs

for path in levels:
do_step(data, path.split('.')[0], 0, path)

for key, value in recs_dict.items():
print(key)
df = pd.DataFrame(recs_dict[key])
print(df)

And here's the output:

level3
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 abs level3
1 abc def 123 abc def 123 abs level3
level2
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 abs level2
1 abc def 123 abc def 123 abs abd
level1
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 asd level1
data
identifiers_0_type identifiers_0_scheme identifiers_0_value identifiers_1_type identifiers_1_scheme identifiers_1_value name type
0 abc def 123 abc def 123 qwer abd

Flatten Nested JSON in Python

The error you got indicates you missed that some of your values are actually a dictionary within an array.

Assuming you want to flatten your json file to retrieve the following keys: mediaType, queueId, count.

These can be retrieved by the following sample code:

import json
with open(path_to_json_file, 'r') as f:
json_dict = json.load(f)

for result in json_dict.get("results"):
media_type = result.get("group").get("mediaType")
queue_id = result.get("group").get("queueId")
n_offered = result.get("data")[0].get("metrics")[0].get("count")

If your data and metrics keys will have multiple indices you will have to use a for loop to retrieve every count value accordingly.

Flatten a tripled nested json into a dataframe

It is similar to what you have in Edit, but perhaps slightly shorter syntax and more performant.

If you have NaN in the DataFrame, older version of Pandas could fail on json_normalize.

This solution should work with Pandas 1.3+.

df = pd.json_normalize(products)
df = df.explode('properties.features')
df = pd.concat([df.drop('properties.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features']).add_prefix('properties.features.')], axis=1)
df = df.explode('properties.features.features')
df = pd.concat([df.drop('properties.features.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features.features']).add_prefix('properties.features.features.')], axis=1)

Perf. with 1000 products.

Code in Edit : 4.85 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This solution: 58.3 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest way to flatten / un-flatten nested JavaScript objects

Here's my much shorter implementation:

Object.unflatten = function(data) {
"use strict";
if (Object(data) !== data || Array.isArray(data))
return data;
var regex = /\.?([^.\[\]]+)|\[(\d+)\]/g,
resultholder = {};
for (var p in data) {
var cur = resultholder,
prop = "",
m;
while (m = regex.exec(p)) {
cur = cur[prop] || (cur[prop] = (m[2] ? [] : {}));
prop = m[2] || m[1];
}
cur[prop] = data[p];
}
return resultholder[""] || resultholder;
};

flatten hasn't changed much (and I'm not sure whether you really need those isEmpty cases):

Object.flatten = function(data) {
var result = {};
function recurse (cur, prop) {
if (Object(cur) !== cur) {
result[prop] = cur;
} else if (Array.isArray(cur)) {
for(var i=0, l=cur.length; i<l; i++)
recurse(cur[i], prop + "[" + i + "]");
if (l == 0)
result[prop] = [];
} else {
var isEmpty = true;
for (var p in cur) {
isEmpty = false;
recurse(cur[p], prop ? prop+"."+p : p);
}
if (isEmpty && prop)
result[prop] = {};
}
}
recurse(data, "");
return result;
}

Together, they run your benchmark in about the half of the time (Opera 12.16: ~900ms instead of ~ 1900ms, Chrome 29: ~800ms instead of ~1600ms).

Note: This and most other solutions answered here focus on speed and are susceptible to prototype pollution and shold not be used on untrusted objects.



Related Topics



Leave a reply



Submit