Pandas Read Nested JSON

Pandas read nested json

You can use json_normalize:

import json

with open('myJson.json') as data_file:
data = json.load(data_file)

df = pd.json_normalize(data, 'locations', ['date', 'number', 'name'],
record_prefix='locations_')
print (df)
locations_arrTime locations_arrTimeDiffMin locations_depTime \
0 06:32
1 06:37 1 06:40
2 08:24 1

locations_depTimeDiffMin locations_name locations_platform \
0 0 Spital am Pyhrn Bahnhof 2
1 0 Windischgarsten Bahnhof 2
2 Linz/Donau Hbf 1A-B

locations_stationIdx locations_track number name date
0 0 R 3932 R 3932 01.10.2016
1 1 R 3932 01.10.2016
2 22 R 3932 01.10.2016

EDIT:

You can use read_json with parsing name by DataFrame constructor and last groupby with apply join:

df = pd.read_json("myJson.json")
df.locations = pd.DataFrame(df.locations.values.tolist())['name']
df = df.groupby(['date','name','number'])['locations'].apply(','.join).reset_index()
print (df)
date name number locations
0 2016-01-10 R 3932 Spital am Pyhrn Bahnhof,Windischgarsten Bahnho...

Read nested JSON into Pandas DataFrame

I suggest using pd.json_normalize() ( https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html ) which helps transform JSON data into a pandas DataFrame.

Note 1: Following I assume the data is available in a python dictionary called data. For testing purpose I used

import json
json_data = '''
{
"meta": {
# ....
},
#...
"included": []
}
'''
data = json.loads(json_data)

where json_data is your JSON response. As json.loads() doesn't accept a final comma, I omitted the comma after the children object.

pd.json_normalize() offers different options. One possibility is to simply read all "children" data and then drop the columns that are not required. Also, after normalizing some columns have a prefix "columns." which needs to be removed.

import pandas as pd
df = pd.json_normalize(data['data']['attributes']['total']['children'])
df.drop(columns=['grouping', 'entity_id'], inplace=True)
df.columns = df.columns.str.replace(r'columns.', '')

Finally, the columns names need to be replaced with those in the "columns" data:

column_name_mapper = {column['key']: column['display_name'] for column in data['meta']['columns']}
df.rename(columns=column_name_mapper, inplace=True)

Note 2: There are some slight deviations from the expected structure you described. Most notably the word 'name' (with the row value "Apple Holdings Adv (748374923)") in the data frame header is not changed to 'Holding Account' as both terms are not found in the columns list. Some other values simply differ between the described JSON response and the expected structure.

How to load a nested json file into a pandas DataFrame

I think you can use json_normalize to load them to pandas.

test.json in this case is your full json file (with double quotes).


import json
from pandas.io.json import json_normalize

with open('path_to_json.json') as f:
data = json.load(f)
df = json_normalize(data, record_path=['features'], meta=['name'])

print(df)

This results in a dataframe as shown below.
Output Dataframe

You can further add record field in the normalize method to create more columns for the polygon coordinates.

You can find more documentation at https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html

Hope that helps.

Flattening deeply nested JSON into pandas data frame

I think the issue here is that your data is double nested... there is a key referenced_tweets within referenced_tweets.

import json
from pandas import json_normalize

with open("flatten.json", "r") as file:
data = json.load(file)

df = json_normalize(
data,
record_path=["referenced_tweets", "referenced_tweets"],
meta=[
"author_id",

# ["author", "username"], # not possible
# "author", # possible but not useful

["referenced_tweets", "id"],
["referenced_tweets", "type"],
["referenced_tweets", "in_reply_to_user_id"],
["referenced_tweets", "in_reply_to_user", "username"],
]
)

print(df)

Sample Image

See also: https://stackoverflow.com/a/37668569/42659


Note: Above code will fail if second nested referenced_tweet is missing.


Edit: Alternatively you could further normalize your data (which you already partly normalized with your code) in your question with an additional manual iteration. See example below. Note: Code is not optimized and may be slow depending on the amount of data.

# load your `data` with `json.load()` or `json.loads()`

df = json_normalize(
data,
record_path="referenced_tweets",
meta=["referenced_tweets", "type"],
meta_prefix= ".",
errors="ignore",
)

columns = [*df.columns, "_type", "_id"]
normalized_data = []

def append(row, type, id):
normalized_row = [*row.to_list(), type, id]
normalized_data.append(normalized_row)

for _, row in df.iterrows():

# a list/array is expected
if type(row["referenced_tweets"]) is list:
for tweet in row["referenced_tweets"]:
append(row, tweet["type"], tweet["id"])
# if empty list
else:
append(row, None, None)
else:
append(row, None, None)

enhanced_df = pd.DataFrame(data=normalized_data, columns=columns)
enhanced_df.drop("referenced_tweets", 1)

print(enhanced_df)

Sample Image


Edit 2: referenced_tweets should be an array. However, if there is no referenced tweet, the Twitter API seems to omit referenced_tweets completely. In that case, the cell value is NaN (float) instead of an empty list. I updated the code above to take that into account.

Normalizing nested JSON object into Pandas dataframe

Personally, I would not use pd.json_normalize for this case. Your JSON is quite complex, and unless you're really experienced with json_normalize, the following code may take less time to understand for the average dev. In fact, you don't even need to see the JSON to understand exactly what this code does (although it would certainly help ;).

First, we can extract the objects (portfolios and their children) from the JSON into a list, and use a series of steps to get them in the right form and order:

def prep_obj(o):
"""Prepares an object (portfolio/child) from the JSON to be inserted into a dataframe."""
return {
'New Entity Group': o['name'],
} | o['columns']

# Get a list of lists, where each sub-list contains the portfolio object at index 0 and then the portfolio object's children:
groups = [[prep_obj(o), *[prep_obj(child) for child in o['children']]] for o in api_response['data']['attributes']['total']['children']]

# Sort the portfolio groups by their number:
groups.sort(key=lambda g: int(g[0]['New Entity Group'].split('_')[1]))

# Reverse the children of each portfolio group:
groups = [[g[0]] + g[1:][::-1] for g in groups]

# Flatten out the groups into one large list of objects:
objects = [obj for group in groups for obj in group]
# The above is exactly equivalent to the following:
# objects = []
# for group in groups:
# for obj in group:
# objects.append(obj)

Next, create the dataframe:

# Create a mapping for column names so that their display names can be used:
mapping = {col['key']: col['display_name'] for col in api_response['meta']['columns']}

# Create a dataframe from the list of objects:
df = pd.DataFrame(objects)

# Correct column names:
df = df.rename(mapping, axis=1)
# Reorder columns:
column_names = ["New Entity Group", "Entity ID", "Adjusted Value (1/31/2022, No Div, USD)", "Adjusted TWR (Current Quarter, No Div, USD)", "Adjusted TWR (YTD, No Div, USD)", "Annualized Adjusted TWR (Since Inception, No Div, USD)", "Inception Date", "Risk Target"]
df = df[column_names]

And formatting:

def format_twr_col(col):
return (
col
.abs()
.mul(100)
.round(2)
.pipe(lambda s: s.where(s.eq(0) | s.isna(), '(' + s.astype(str) + '%)'))
.pipe(lambda s: s.where(s.ne(0) | s.isna(), s.astype(str) + '%'))
.fillna('-')
)

def format_value_col(col):
positive_mask = col.ge(0)

col[positive_mask] = (
col[positive_mask]
.round()
.astype(int)
.map('${:,}'.format)
)

col[~positive_mask] = (
col[~positive_mask]
.astype(float)
.round()
.astype(int)
.abs()
.map('(${:,})'.format)
)

return col

df['Adjusted TWR (Current Quarter, No Div, USD)'] = format_twr_col(df['Adjusted TWR (Current Quarter, No Div, USD)'])
df['Annualized Adjusted TWR (Since Inception, No Div, USD)'] = format_twr_col(df['Annualized Adjusted TWR (Since Inception, No Div, USD)'])
df['Adjusted TWR (YTD, No Div, USD)'] = format_twr_col(df['Adjusted TWR (YTD, No Div, USD)'])

df['Adjusted Value (1/31/2022, No Div, USD)'] = format_value_col(df['Adjusted Value (1/31/2022, No Div, USD)'].copy())

df['Inception Date'] = pd.to_datetime(df['Inception Date']).dt.strftime('%b %d, %Y')

df['Entity ID'] = df['Entity ID'].fillna('')

And... voilà:

>>> pd.options.display.max_columns = None
>>> df
New Entity Group Entity ID Adjusted Value (1/31/2022, No Div, USD) Adjusted TWR (Current Quarter, No Div, USD) Adjusted TWR (YTD, No Div, USD) Annualized Adjusted TWR (Since Inception, No Div, USD) Inception Date Risk Target
0 Portfolio_1 $260,786 (44.55%) (44.55%) (44.55%) Apr 07, 2021 N/A
1 The FW Irrev Family Tr 9552252 $260,786 0.0% 0.0% 0.0% Jan 11, 2022 N/A
2 Portfolio_2 $18,396,664 (5.78%) (5.78%) (5.47%) Sep 03, 2021 Growth
3 FW DAF 10946585 $18,396,664 (5.78%) (5.78%) (5.47%) Sep 03, 2021 Growth
4 Portfolio_3 $60,143,818 (4.42%) (4.42%) (7.75%) Dec 17, 2020 NaN
5 The FW Family Trust 13014080 $475,356 (6.1%) (6.1%) (3.97%) Apr 09, 2021 Aggressive
6 FW Liquid Fund LP 13396796 $52,899,527 (4.15%) (4.15%) (4.15%) Dec 30, 2021 Aggressive
7 FW Holdings No. 2 LLC 8413655 $6,768,937 (0.77%) (0.77%) (11.84%) Mar 05, 2021 N/A
8 FW and FR Joint 9957007 ($1) - - - Dec 21, 2021 N/A

Parsing nested json into pandas DataFrame

You haven't specified exactly what is the output that you are looking for. Here is how I've done it:

import json
import pandas as pd

with open('tournament_7.json') as data_file:
data = json.load(data_file)

df = pd.DataFrame(columns=['name','start_date', 'end_date', 'tours', 'type', 'winner'])

for tour in data["games"]:
df = pd.concat([df,(pd.json_normalize(data, record_path=['games', tour], meta=['name','start_date', 'end_date', 'tours', 'type', 'winner']))], ignore_index = True, axis = 0)

print(df)

So I just loop through the different tours that are present in the games dictionary. Then I concat the resultant DataFrame. You may want to add a column that specifies which tour this row is for but that is up to you.

Convert nested JSON to pandas DataFrame

To unpack the dictionary, use json_normalize with a record_path=... argument.

import pandas.io.json as pd_json

data = pd_json.loads(result)
pd_json.json_normalize(data, record_path='data')

date marketCap
0 2018-01-12 232547809668.32000000
1 2018-01-13 241311607656.32000000

If you want the other values as well, pass a meta=.... argument:

df = pd_json.json_normalize(data, 
record_path='data',
meta=['coin', 'dataType', 'baseCurrency'])
df

date marketCap ... dataType baseCurrency
0 2018-01-12 232547809668.32000000 ... marketCap USD
1 2018-01-13 241311607656.32000000 ... marketCap USD

df.columns
# Index(['date', 'marketCap', 'coin', 'dataType', 'baseCurrency'], dtype='object')

how to read nested json file in pandas dataframe?

import os
import glob
import json

from pandas.io.json import json_normalize

path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/'
json_paths = glob.glob(os.path.join(path_to_json, "*.json"))
df = pd.concat((json_normalize(json.load(open(p))) for p in json_paths), axis=0)
df = df.reset_index(drop=True) # Optionally reset index.

This will load all your json files into a single dataframe.
It will also flatten the nested json hierarchy by adding '.' between the keys.

You will probably need to perform further data cleaning, for e.g., by replacing the NaNs with appropriate values. This can be done with the the dataframe's fillna, or by applying a function to transform individual values.

Edit

As I mentioned in the comment, the data is actually messy, so words such as "View All Post" can be one of the values for "authors". See the JSON "BuzzFeed_Fake_26-Webpage.json" for an example.

To remove these entries and possibly others,

# This will be a set of entries you wish to remove.
# Here we only consider "View All Posts".
invalid_entries = {"View All Posts"}

import functools
def fix(x, invalid):
if isinstance(x, list):
return [i for i in x if i not in invalid]
else:
# You can optionally choose to return [] here to fix the NaNs
# and to standardize the types of the values in this column
return x

fix_author = functools.partial(fix, invalid=invalid_entries)
df["authors"] = df.authors.apply(fix_author)


Related Topics



Leave a reply



Submit