Pandas read nested json
You can use json_normalize
:
import json
with open('myJson.json') as data_file:
data = json.load(data_file)
df = pd.json_normalize(data, 'locations', ['date', 'number', 'name'],
record_prefix='locations_')
print (df)
locations_arrTime locations_arrTimeDiffMin locations_depTime \
0 06:32
1 06:37 1 06:40
2 08:24 1
locations_depTimeDiffMin locations_name locations_platform \
0 0 Spital am Pyhrn Bahnhof 2
1 0 Windischgarsten Bahnhof 2
2 Linz/Donau Hbf 1A-B
locations_stationIdx locations_track number name date
0 0 R 3932 R 3932 01.10.2016
1 1 R 3932 01.10.2016
2 22 R 3932 01.10.2016
EDIT:
You can use read_json
with parsing name
by DataFrame
constructor and last groupby
with apply join
:
df = pd.read_json("myJson.json")
df.locations = pd.DataFrame(df.locations.values.tolist())['name']
df = df.groupby(['date','name','number'])['locations'].apply(','.join).reset_index()
print (df)
date name number locations
0 2016-01-10 R 3932 Spital am Pyhrn Bahnhof,Windischgarsten Bahnho...
Read nested JSON into Pandas DataFrame
I suggest using pd.json_normalize()
( https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html ) which helps transform JSON data into a pandas DataFrame.
Note 1: Following I assume the data is available in a python dictionary called data
. For testing purpose I used
import json
json_data = '''
{
"meta": {
# ....
},
#...
"included": []
}
'''
data = json.loads(json_data)
where json_data
is your JSON response. As json.loads()
doesn't accept a final comma, I omitted the comma after the children object.
pd.json_normalize()
offers different options. One possibility is to simply read all "children" data and then drop the columns that are not required. Also, after normalizing some columns have a prefix "columns." which needs to be removed.
import pandas as pd
df = pd.json_normalize(data['data']['attributes']['total']['children'])
df.drop(columns=['grouping', 'entity_id'], inplace=True)
df.columns = df.columns.str.replace(r'columns.', '')
Finally, the columns names need to be replaced with those in the "columns" data:
column_name_mapper = {column['key']: column['display_name'] for column in data['meta']['columns']}
df.rename(columns=column_name_mapper, inplace=True)
Note 2: There are some slight deviations from the expected structure you described. Most notably the word 'name' (with the row value "Apple Holdings Adv (748374923)") in the data frame header is not changed to 'Holding Account' as both terms are not found in the columns list. Some other values simply differ between the described JSON response and the expected structure.
How to load a nested json file into a pandas DataFrame
I think you can use json_normalize to load them to pandas.
test.json in this case is your full json file (with double quotes).
import json
from pandas.io.json import json_normalize
with open('path_to_json.json') as f:
data = json.load(f)
df = json_normalize(data, record_path=['features'], meta=['name'])
print(df)
This results in a dataframe as shown below.
You can further add record field in the normalize method to create more columns for the polygon coordinates.
You can find more documentation at https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html
Hope that helps.
Flattening deeply nested JSON into pandas data frame
I think the issue here is that your data is double nested... there is a key referenced_tweets
within referenced_tweets
.
import json
from pandas import json_normalize
with open("flatten.json", "r") as file:
data = json.load(file)
df = json_normalize(
data,
record_path=["referenced_tweets", "referenced_tweets"],
meta=[
"author_id",
# ["author", "username"], # not possible
# "author", # possible but not useful
["referenced_tweets", "id"],
["referenced_tweets", "type"],
["referenced_tweets", "in_reply_to_user_id"],
["referenced_tweets", "in_reply_to_user", "username"],
]
)
print(df)
See also: https://stackoverflow.com/a/37668569/42659
Note: Above code will fail if second nested referenced_tweet
is missing.
Edit: Alternatively you could further normalize your data (which you already partly normalized with your code) in your question with an additional manual iteration. See example below. Note: Code is not optimized and may be slow depending on the amount of data.
# load your `data` with `json.load()` or `json.loads()`
df = json_normalize(
data,
record_path="referenced_tweets",
meta=["referenced_tweets", "type"],
meta_prefix= ".",
errors="ignore",
)
columns = [*df.columns, "_type", "_id"]
normalized_data = []
def append(row, type, id):
normalized_row = [*row.to_list(), type, id]
normalized_data.append(normalized_row)
for _, row in df.iterrows():
# a list/array is expected
if type(row["referenced_tweets"]) is list:
for tweet in row["referenced_tweets"]:
append(row, tweet["type"], tweet["id"])
# if empty list
else:
append(row, None, None)
else:
append(row, None, None)
enhanced_df = pd.DataFrame(data=normalized_data, columns=columns)
enhanced_df.drop("referenced_tweets", 1)
print(enhanced_df)
Edit 2: referenced_tweets
should be an array. However, if there is no referenced tweet, the Twitter API seems to omit referenced_tweets
completely. In that case, the cell value is NaN
(float) instead of an empty list. I updated the code above to take that into account.
Normalizing nested JSON object into Pandas dataframe
Personally, I would not use pd.json_normalize
for this case. Your JSON is quite complex, and unless you're really experienced with json_normalize
, the following code may take less time to understand for the average dev. In fact, you don't even need to see the JSON to understand exactly what this code does (although it would certainly help ;).
First, we can extract the objects (portfolios and their children) from the JSON into a list, and use a series of steps to get them in the right form and order:
def prep_obj(o):
"""Prepares an object (portfolio/child) from the JSON to be inserted into a dataframe."""
return {
'New Entity Group': o['name'],
} | o['columns']
# Get a list of lists, where each sub-list contains the portfolio object at index 0 and then the portfolio object's children:
groups = [[prep_obj(o), *[prep_obj(child) for child in o['children']]] for o in api_response['data']['attributes']['total']['children']]
# Sort the portfolio groups by their number:
groups.sort(key=lambda g: int(g[0]['New Entity Group'].split('_')[1]))
# Reverse the children of each portfolio group:
groups = [[g[0]] + g[1:][::-1] for g in groups]
# Flatten out the groups into one large list of objects:
objects = [obj for group in groups for obj in group]
# The above is exactly equivalent to the following:
# objects = []
# for group in groups:
# for obj in group:
# objects.append(obj)
Next, create the dataframe:
# Create a mapping for column names so that their display names can be used:
mapping = {col['key']: col['display_name'] for col in api_response['meta']['columns']}
# Create a dataframe from the list of objects:
df = pd.DataFrame(objects)
# Correct column names:
df = df.rename(mapping, axis=1)
# Reorder columns:
column_names = ["New Entity Group", "Entity ID", "Adjusted Value (1/31/2022, No Div, USD)", "Adjusted TWR (Current Quarter, No Div, USD)", "Adjusted TWR (YTD, No Div, USD)", "Annualized Adjusted TWR (Since Inception, No Div, USD)", "Inception Date", "Risk Target"]
df = df[column_names]
And formatting:
def format_twr_col(col):
return (
col
.abs()
.mul(100)
.round(2)
.pipe(lambda s: s.where(s.eq(0) | s.isna(), '(' + s.astype(str) + '%)'))
.pipe(lambda s: s.where(s.ne(0) | s.isna(), s.astype(str) + '%'))
.fillna('-')
)
def format_value_col(col):
positive_mask = col.ge(0)
col[positive_mask] = (
col[positive_mask]
.round()
.astype(int)
.map('${:,}'.format)
)
col[~positive_mask] = (
col[~positive_mask]
.astype(float)
.round()
.astype(int)
.abs()
.map('(${:,})'.format)
)
return col
df['Adjusted TWR (Current Quarter, No Div, USD)'] = format_twr_col(df['Adjusted TWR (Current Quarter, No Div, USD)'])
df['Annualized Adjusted TWR (Since Inception, No Div, USD)'] = format_twr_col(df['Annualized Adjusted TWR (Since Inception, No Div, USD)'])
df['Adjusted TWR (YTD, No Div, USD)'] = format_twr_col(df['Adjusted TWR (YTD, No Div, USD)'])
df['Adjusted Value (1/31/2022, No Div, USD)'] = format_value_col(df['Adjusted Value (1/31/2022, No Div, USD)'].copy())
df['Inception Date'] = pd.to_datetime(df['Inception Date']).dt.strftime('%b %d, %Y')
df['Entity ID'] = df['Entity ID'].fillna('')
And... voilà:
>>> pd.options.display.max_columns = None
>>> df
New Entity Group Entity ID Adjusted Value (1/31/2022, No Div, USD) Adjusted TWR (Current Quarter, No Div, USD) Adjusted TWR (YTD, No Div, USD) Annualized Adjusted TWR (Since Inception, No Div, USD) Inception Date Risk Target
0 Portfolio_1 $260,786 (44.55%) (44.55%) (44.55%) Apr 07, 2021 N/A
1 The FW Irrev Family Tr 9552252 $260,786 0.0% 0.0% 0.0% Jan 11, 2022 N/A
2 Portfolio_2 $18,396,664 (5.78%) (5.78%) (5.47%) Sep 03, 2021 Growth
3 FW DAF 10946585 $18,396,664 (5.78%) (5.78%) (5.47%) Sep 03, 2021 Growth
4 Portfolio_3 $60,143,818 (4.42%) (4.42%) (7.75%) Dec 17, 2020 NaN
5 The FW Family Trust 13014080 $475,356 (6.1%) (6.1%) (3.97%) Apr 09, 2021 Aggressive
6 FW Liquid Fund LP 13396796 $52,899,527 (4.15%) (4.15%) (4.15%) Dec 30, 2021 Aggressive
7 FW Holdings No. 2 LLC 8413655 $6,768,937 (0.77%) (0.77%) (11.84%) Mar 05, 2021 N/A
8 FW and FR Joint 9957007 ($1) - - - Dec 21, 2021 N/A
Parsing nested json into pandas DataFrame
You haven't specified exactly what is the output that you are looking for. Here is how I've done it:
import json
import pandas as pd
with open('tournament_7.json') as data_file:
data = json.load(data_file)
df = pd.DataFrame(columns=['name','start_date', 'end_date', 'tours', 'type', 'winner'])
for tour in data["games"]:
df = pd.concat([df,(pd.json_normalize(data, record_path=['games', tour], meta=['name','start_date', 'end_date', 'tours', 'type', 'winner']))], ignore_index = True, axis = 0)
print(df)
So I just loop through the different tours that are present in the games dictionary. Then I concat the resultant DataFrame. You may want to add a column that specifies which tour this row is for but that is up to you.
Convert nested JSON to pandas DataFrame
To unpack the dictionary, use json_normalize
with a record_path=...
argument.
import pandas.io.json as pd_json
data = pd_json.loads(result)
pd_json.json_normalize(data, record_path='data')
date marketCap
0 2018-01-12 232547809668.32000000
1 2018-01-13 241311607656.32000000
If you want the other values as well, pass a meta=....
argument:
df = pd_json.json_normalize(data,
record_path='data',
meta=['coin', 'dataType', 'baseCurrency'])
df
date marketCap ... dataType baseCurrency
0 2018-01-12 232547809668.32000000 ... marketCap USD
1 2018-01-13 241311607656.32000000 ... marketCap USD
df.columns
# Index(['date', 'marketCap', 'coin', 'dataType', 'baseCurrency'], dtype='object')
how to read nested json file in pandas dataframe?
import os
import glob
import json
from pandas.io.json import json_normalize
path_to_json = 'FakeNewsNetData/BuzzFeed/FakeNewsContent/'
json_paths = glob.glob(os.path.join(path_to_json, "*.json"))
df = pd.concat((json_normalize(json.load(open(p))) for p in json_paths), axis=0)
df = df.reset_index(drop=True) # Optionally reset index.
This will load all your json files into a single dataframe.
It will also flatten the nested json hierarchy by adding '.' between the keys.
You will probably need to perform further data cleaning, for e.g., by replacing the NaNs with appropriate values. This can be done with the the dataframe's fillna
, or by applying a function to transform individual values.
Edit
As I mentioned in the comment, the data is actually messy, so words such as "View All Post" can be one of the values for "authors". See the JSON "BuzzFeed_Fake_26-Webpage.json" for an example.
To remove these entries and possibly others,
# This will be a set of entries you wish to remove.
# Here we only consider "View All Posts".
invalid_entries = {"View All Posts"}
import functools
def fix(x, invalid):
if isinstance(x, list):
return [i for i in x if i not in invalid]
else:
# You can optionally choose to return [] here to fix the NaNs
# and to standardize the types of the values in this column
return x
fix_author = functools.partial(fix, invalid=invalid_entries)
df["authors"] = df.authors.apply(fix_author)
Related Topics
Why Does Sys.Exit() Not Exit When Called Inside a Thread in Python
Parsing Datetime Strings Containing Nanoseconds
How to Determine the Language of a Piece of Text
Python Nested Functions Variable Scoping
How to Find Char in String and Get All the Indexes
How to Compare Multiple Variables to the Same Value
Unicodedecodeerror When Redirecting to File
Capture Keyboardinterrupt in Python Without Try-Except
Tensorflow Read Images with Labels
Pycharm Doesn't Recognise Installed Module
How to Convert a Numpy Array to (And Display) an Image
Numpy Index Slice Without Losing Dimension Information
Python: Simple List Merging Based on Intersections
Attributeerror: 'Tensor' Object Has No Attribute 'Numpy'
How to Display Full Output in Jupyter, Not Only Last Result
Decorating Class Methods - How to Pass the Instance to the Decorator