How to Normalize JSON Correctly by Python Pandas

How to normalize json correctly by Python Pandas

You could just pass data without any extra params.

df = pd.io.json.json_normalize(data)
df

complete mid.c mid.h mid.l mid.o time volume
0 True 119.743 119.891 119.249 119.341 1488319200.000000000 14651
1 True 119.893 119.954 119.552 119.738 1488348000.000000000 10738
2 True 119.946 120.221 119.840 119.888 1488376800.000000000 10041

If you want to change the column order, use df.reindex:

df = df.reindex(columns=['time', 'volume', 'complete', 'mid.h', 'mid.l', 'mid.c', 'mid.o'])
df

time volume complete mid.h mid.l mid.c mid.o
0 1488319200.000000000 14651 True 119.891 119.249 119.743 119.341
1 1488348000.000000000 10738 True 119.954 119.552 119.893 119.738
2 1488376800.000000000 10041 True 120.221 119.840 119.946 119.888

Pandas: JSON Normalize with brackets around the JSON?

If your json objects are under the xd columns, you can exctract that json, which is a list of dictionaries. A list of dictionaries can be used to create a dataframe object, from here.

list_of_dicts = list_of_dicts=list(map(lambda l: l[0], df['xd'].to_list()))
expected = pd.Dataframe(list_of_dicts)

Does this answer your question?

How to read and normalize following json in pandas?

Here is another way:

df = pd.read_json(r'C:\path\file.json')

final=df.stack().str[0].unstack()
final=final.assign(cities=final['cities'].str.split(',')).explode('cities')
final=final.assign(**pd.DataFrame(final.pop('user').str[0].tolist()))
print(final)

      session_id unix_timestamp            cities  user_id joining_date  \
0 X061RFWB06K9V 1442503708 New York NY 2024 2015-03-22
0 X061RFWB06K9V 1442503708 Newark NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 New York NY 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Jersey City NJ 2024 2015-03-22
1 5AZ2X2A9BHH5U 1441353991 Philadelphia PA 2024 2015-03-22

country
0 UK
0 UK
1 UK
1 UK
1 UK

Normalizing nested JSON object into Pandas dataframe

Personally, I would not use pd.json_normalize for this case. Your JSON is quite complex, and unless you're really experienced with json_normalize, the following code may take less time to understand for the average dev. In fact, you don't even need to see the JSON to understand exactly what this code does (although it would certainly help ;).

First, we can extract the objects (portfolios and their children) from the JSON into a list, and use a series of steps to get them in the right form and order:

def prep_obj(o):
"""Prepares an object (portfolio/child) from the JSON to be inserted into a dataframe."""
return {
'New Entity Group': o['name'],
} | o['columns']

# Get a list of lists, where each sub-list contains the portfolio object at index 0 and then the portfolio object's children:
groups = [[prep_obj(o), *[prep_obj(child) for child in o['children']]] for o in api_response['data']['attributes']['total']['children']]

# Sort the portfolio groups by their number:
groups.sort(key=lambda g: int(g[0]['New Entity Group'].split('_')[1]))

# Reverse the children of each portfolio group:
groups = [[g[0]] + g[1:][::-1] for g in groups]

# Flatten out the groups into one large list of objects:
objects = [obj for group in groups for obj in group]
# The above is exactly equivalent to the following:
# objects = []
# for group in groups:
# for obj in group:
# objects.append(obj)

Next, create the dataframe:

# Create a mapping for column names so that their display names can be used:
mapping = {col['key']: col['display_name'] for col in api_response['meta']['columns']}

# Create a dataframe from the list of objects:
df = pd.DataFrame(objects)

# Correct column names:
df = df.rename(mapping, axis=1)
# Reorder columns:
column_names = ["New Entity Group", "Entity ID", "Adjusted Value (1/31/2022, No Div, USD)", "Adjusted TWR (Current Quarter, No Div, USD)", "Adjusted TWR (YTD, No Div, USD)", "Annualized Adjusted TWR (Since Inception, No Div, USD)", "Inception Date", "Risk Target"]
df = df[column_names]

And formatting:

def format_twr_col(col):
return (
col
.abs()
.mul(100)
.round(2)
.pipe(lambda s: s.where(s.eq(0) | s.isna(), '(' + s.astype(str) + '%)'))
.pipe(lambda s: s.where(s.ne(0) | s.isna(), s.astype(str) + '%'))
.fillna('-')
)

def format_value_col(col):
positive_mask = col.ge(0)

col[positive_mask] = (
col[positive_mask]
.round()
.astype(int)
.map('${:,}'.format)
)

col[~positive_mask] = (
col[~positive_mask]
.astype(float)
.round()
.astype(int)
.abs()
.map('(${:,})'.format)
)

return col

df['Adjusted TWR (Current Quarter, No Div, USD)'] = format_twr_col(df['Adjusted TWR (Current Quarter, No Div, USD)'])
df['Annualized Adjusted TWR (Since Inception, No Div, USD)'] = format_twr_col(df['Annualized Adjusted TWR (Since Inception, No Div, USD)'])
df['Adjusted TWR (YTD, No Div, USD)'] = format_twr_col(df['Adjusted TWR (YTD, No Div, USD)'])

df['Adjusted Value (1/31/2022, No Div, USD)'] = format_value_col(df['Adjusted Value (1/31/2022, No Div, USD)'].copy())

df['Inception Date'] = pd.to_datetime(df['Inception Date']).dt.strftime('%b %d, %Y')

df['Entity ID'] = df['Entity ID'].fillna('')

And... voilà:

>>> pd.options.display.max_columns = None
>>> df
New Entity Group Entity ID Adjusted Value (1/31/2022, No Div, USD) Adjusted TWR (Current Quarter, No Div, USD) Adjusted TWR (YTD, No Div, USD) Annualized Adjusted TWR (Since Inception, No Div, USD) Inception Date Risk Target
0 Portfolio_1 $260,786 (44.55%) (44.55%) (44.55%) Apr 07, 2021 N/A
1 The FW Irrev Family Tr 9552252 $260,786 0.0% 0.0% 0.0% Jan 11, 2022 N/A
2 Portfolio_2 $18,396,664 (5.78%) (5.78%) (5.47%) Sep 03, 2021 Growth
3 FW DAF 10946585 $18,396,664 (5.78%) (5.78%) (5.47%) Sep 03, 2021 Growth
4 Portfolio_3 $60,143,818 (4.42%) (4.42%) (7.75%) Dec 17, 2020 NaN
5 The FW Family Trust 13014080 $475,356 (6.1%) (6.1%) (3.97%) Apr 09, 2021 Aggressive
6 FW Liquid Fund LP 13396796 $52,899,527 (4.15%) (4.15%) (4.15%) Dec 30, 2021 Aggressive
7 FW Holdings No. 2 LLC 8413655 $6,768,937 (0.77%) (0.77%) (11.84%) Mar 05, 2021 N/A
8 FW and FR Joint 9957007 ($1) - - - Dec 21, 2021 N/A

How to normalize a nested json with json_normalize

  • Use pandas.json_normalize()
  • The following code uses pandas v.1.2.4
  • If you don't want the other columns, remove the list of keys assigned to meta
  • Use pandas.DataFrame.drop to remove any other unwanted columns from df.
import pandas as pd

df = pd.json_normalize(data, record_path=['results', 'docs'], meta=[['results', 'name'], 'numberOfResults'])

display(df)
id type category media label title subtitle results.name numberOfResults
0 RAKDI342342 Culture Culture unknown exampellabel testtitle and titletest Archive single 376
1 GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER Culture Culture image more label als example test the second title picture single 376

Data

  • The posted JSON / Dict is not correctly formed
  • Assuming the following corrected form
data = \
{'numberOfResults': 376,
'results': [{'docs': [{'category': 'Culture',
'id': 'RAKDI342342',
'label': 'exampellabel',
'media': 'unknown',
'subtitle': 'Archive',
'title': 'testtitle and titletest',
'type': 'Culture'},
{'category': 'Culture',
'id': 'GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER',
'label': 'more label als example',
'media': 'image',
'subtitle': 'picture',
'title': 'test the second title',
'type': 'Culture'}],
'name': 'single'}]}

Pandas JSON Normalize - Choose Correct Record Path

You can try to apply the following function to your json:

def flatten_nested_json_df(df):
df = df.reset_index()
s = (df.applymap(type) == list).all()
list_columns = s[s].index.tolist()

s = (df.applymap(type) == dict).all()
dict_columns = s[s].index.tolist()


while len(list_columns) > 0 or len(dict_columns) > 0:
new_columns = []

for col in dict_columns:
horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
horiz_exploded.index = df.index
df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
new_columns.extend(horiz_exploded.columns) # inplace

for col in list_columns:
#print(f"exploding: {col}")
df = df.drop(columns=[col]).join(df[col].explode().to_frame())
new_columns.append(col)

s = (df[new_columns].applymap(type) == list).all()
list_columns = s[s].index.tolist()

s = (df[new_columns].applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
return df

by doing this:

df1= flatten_nested_json_df(df)

where

df = pd.json_normalize(json)

That should give you all the information contained in your json.

How do I unpack multiple levels using json_normalize in python pandas?

Fix your dictionary first, it's not consistent, this makes it consistent:

for i, x in enumerate(data):
x = x['Source'][0]['Movies']
if not isinstance(x, list):
data[i]['Source'][0]['Movies'] = [x]

Then json_normalize works just fine:

df = pd.json_normalize(data, ['Source','Movies'], ['Name', 'Year', 'Location'])
print(df)

Output:

   MovieNumber  Money  Percent   Name  Year Location
0 1 1000 10 Rocco 2020 Itay
1 1 2000 10 Anja 2021 Germany
2 2 3000 10 Anja 2021 Germany
3 1 1000 10 Kasia 2021 Poland
4 2 1000 10 Kasia 2021 Poland
5 3 1000 10 Kasia 2021 Poland

What my code actually did, Before:

[
{
"Name": "Rocco",
"Year": 2020,
"Location": "Itay",
"Source": [
{
"Movies": # Here, Movies isn't a list.
{"MovieNumber": 1, "Money": 1000, "Percent": 10}
}
]
},
{
"Name": "Anja",
"Year": 2021,
"Location": "Germany",
"Source": [
{
"Movies": [ # Here, Movies is a list.
{"MovieNumber": 1, "Money": 2000, "Percent": 10},
{"MovieNumber": 2, "Money": 3000, "Percent": 10}
]
}
]
}
]

After:

[
{
"Name": "Rocco",
"Year": 2020,
"Location": "Itay",
"Source": [
{
"Movies": [ # Now this is a list.
{"MovieNumber": 1, "Money": 1000, "Percent": 10}
]
}
]
},
{
"Name": "Anja",
"Year": 2021,
"Location": "Germany",
"Source": [
{
"Movies": [ # And this remains unchanged.
{"MovieNumber": 1, "Money": 2000, "Percent": 10},
{"MovieNumber": 2, "Money": 3000, "Percent": 10 }
]
}
]
}
]

So all I did was force all Source.Movies to be lists, by putting the contents in a list if it wasn't already a list.

Python pandas normalize this Json into pandas

Use:

gateio = pd.json_normalize(e)
gateio.columns = gateio.columns.str.split('.', expand=True)
df = gateio.rename_axis(('symbol', None), axis=1).stack(0).droplevel(0).reset_index()


print(df)
symbol baseVolume high24hr highestBid \
0 100x_usdt 0 0
1 10set_eth 0 0
2 10set_usdt 78055.955772115 2.334 2.3189
3 1art_usdt 84629.671759612 0.020476 0.020051
4 1earth_eth 0 0
... ... ... ...
3023 zrx_usd 378.6665316 0.3075 0.3036
3024 zrx_usdt 21064.601829316 0.3074 0.3038
3025 zsc_eth 6.5764445243 0.00000006666 0.00000005859
3026 zsc_usdt 12105.551030017 0.000099271 0.00009592
3027 ztg_usdt 17735.456307939 0.10993 0.0993

last low24hr lowestAsk percentChange \
0 0.00000001677 0 0
1 0 0 0
2 2.3258 2.25 2.3315 0.54
3 0.020139 0.019922 0.020318 -0.62
4 0 0 0
... ... ... ...
3023 0.3053 0.2919 0.3048 4.05
3024 0.3046 0.2923 0.3043 4.35
3025 0.00000006116 0.00000005942 0.00000006438 -7.91
3026 0.000098951 0.000095918 0.000101036 2.53
3027 0.09977 0.09929 0.1003 -7.96

quoteVolume result
0 0 true
1 0 true
2 34176.76678812 true
3 4186530.9550705 true
4 0 true
... ...
3023 1250.925 true
3024 69748.810196325 true
3025 105661371 true
3026 125394404.8585 true
3027 169037.51711601 true

[3028 rows x 10 columns]

Another idea is create DataFrame by constructor and pivoting:

gateio = requests.get("https://data.gateapi.io/api2/1/tickers")
e = gateio.json()
df = pd.DataFrame([(k,k1, v1) for k, v in e.items() for k1, v1 in v.items()]).pivot(0,1,2)
print(df)
1 baseVolume high24hr highestBid last \
0
100x_usdt 0 0 0.00000001677
10set_eth 0 0 0
10set_usdt 77135.369425029 2.334 2.3189 2.324
1art_usdt 85135.129113461 0.020476 0.020073 0.020231
1earth_eth 0 0 0
... ... ... ...
zrx_usd 378.7539874 0.3075 0.3031 0.3036
zrx_usdt 20969.605384316 0.3074 0.3034 0.3048
zsc_eth 6.54257544205 0.00000006666 0.00000005891 0.00000006175
zsc_usdt 12071.777701317 0.000099271 0.00009592 0.00009804
ztg_usdt 17614.164813459 0.10918 0.0993 0.0998

1 low24hr lowestAsk percentChange quoteVolume result
0
100x_usdt 0 0 0 true
10set_eth 0 0 0 true
10set_usdt 2.25 2.3303 0.31 33779.242174485 true
1art_usdt 0.019922 0.02037 0.32 4211596.8280705 true
1earth_eth 0 0 0 true
... ... ... ... ...
zrx_usd 0.2919 0.3046 3.47 1251.201 true
zrx_usdt 0.2923 0.3041 4.27 69423.160196325 true
zsc_eth 0.00000005942 0.00000006479 -7.18 105182158 true
zsc_usdt 0.000095918 0.000100982 1.6 125041663.4785 true
ztg_usdt 0.09929 0.1002 -8.8 167942.13011601 true

[3028 rows x 9 columns]

How to normalize a nested JSON key into a pandas dataframe

  • The 'results' key is a 1 element list, so 'members' can be normalized by selecting the 'members' key from the dict at index 0.
import pandas as pd
import requests

# Requesting data trhough API
payload = {'X-API-Key': '...'}
terms = '"trade war"AND"China"'
index = str(0) # 440 is last offset for this call

response = requests.get('https://api.propublica.org/congress/v1/116/house/members.json', headers=payload)

# extract the json data from the response
json_data = response.json()

# normalize only members
members = pd.json_normalize(data=json_data['results'][0]['members'])

# alternatively: normalize members and the preceding keys
members = pd.json_normalize(data=json_data['results'][0], record_path=['members'], meta=['congress', 'chamber', 'num_results', 'offset'])

display(members)

        id           title short_title                                                      api_uri first_name middle_name  last_name suffix date_of_birth gender party leadership_role  twitter_account         facebook_account youtube_account govtrack_id cspan_id votesmart_id icpsr_id     crp_id google_entity_id fec_candidate_id                          url                                         rss_url contact_form  in_office cook_pvi  dw_nominate ideal_point seniority next_election  total_votes  missed_votes  total_present               last_updated                                  ocd_id                                office         phone   fax state  district  at_large geoid  missed_votes_pct  votes_with_party_pct  votes_against_party_pct
0 A000374 Representative Rep. https://api.propublica.org/congress/v1/members/A000374.json Ralph None Abraham None 1954-09-16 M R RepAbraham CongressmanRalphAbraham None 412630 76236 155414 21522 N00036633 /m/012dwd7_ H4LA05221 https://abraham.house.gov https://abraham.house.gov/rss.xml None False R+15 0.541 None 6 2020 954.0 377.0 0.0 2020-12-31 18:30:50 -0500 ocd-division/country:us/state:la/cd:5 417 Cannon House Office Building 202-225-8490 None LA 5 False 2205 39.52 94.93 4.90
1 A000370 Representative Rep. https://api.propublica.org/congress/v1/members/A000370.json Alma None Adams None 1946-05-27 F D None RepAdams CongresswomanAdams None 412607 76386 5935 21545 N00035451 /m/02b45d H4NC12100 https://adams.house.gov https://adams.house.gov/rss.xml None False D+18 -0.465 None 8 2020 954.0 26.0 0.0 2020-12-31 18:30:55 -0500 ocd-division/country:us/state:nc/cd:12 2436 Rayburn House Office Building 202-225-1510 None NC 12 False 3712 2.73 99.24 0.65
2 A000055 Representative Rep. https://api.propublica.org/congress/v1/members/A000055.json Robert B. Aderholt None 1965-07-22 M R None Robert_Aderholt RobertAderholt RobertAderholt 400004 45516 441 29701 N00003028 /m/024p03 H6AL04098 https://aderholt.house.gov https://aderholt.house.gov/rss.xml None False R+30 0.369 None 24 2020 954.0 71.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:al/cd:4 1203 Longworth House Office Building 202-225-4876 None AL 4 False 0104 7.44 93.60 6.29
3 A000371 Representative Rep. https://api.propublica.org/congress/v1/members/A000371.json Pete None Aguilar None 1979-06-19 M D None reppeteaguilar reppeteaguilar None 412615 79994 70114 21506 N00033997 /m/0jwv0xf H2CA31125 https://aguilar.house.gov https://aguilar.house.gov/rss.xml None False D+8 -0.291 None 6 2020 954.0 9.0 0.0 2020-12-31 18:30:52 -0500 ocd-division/country:us/state:ca/cd:31 109 Cannon House Office Building 202-225-3201 None CA 31 False 0631 0.94 97.45 2.44
4 A000372 Representative Rep. https://api.propublica.org/congress/v1/members/A000372.json Rick None Allen None 1951-11-07 M R None reprickallen CongressmanRickAllen None 412625 62545 136062 21516 N00033720 /m/0127y9dk H2GA12121 https://allen.house.gov None None False R+9 0.679 None 6 2020 954.0 15.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:ga/cd:12 2400 Rayburn House Office Building 202-225-2823 None GA 12 False 1312 1.57 92.26 7.63
5 A000376 Representative Rep. https://api.propublica.org/congress/v1/members/A000376.json Colin None Allred None 1983-04-15 M D None RepColinAllred None None 412828 None 177357 None N00040989 /m/03d066b H8TX32098 https://allred.house.gov None None False R+5 NaN None 2 2020 954.0 29.0 0.0 2020-12-31 18:30:52 -0500 ocd-division/country:us/state:tx/cd:32 328 Cannon House Office Building 202-225-2231 None TX 32 False 4832 3.04 97.72 2.17
6 A000367 Representative Rep. https://api.propublica.org/congress/v1/members/A000367.json Justin None Amash None 1980-04-18 M I justinamash repjustinamash repjustinamash 412438 1033767 105566 21143 N00031938 /m/0c00p_n https://amash.house.gov https://amash.house.gov/rss.xml None False R+6 NaN None 10 2020 524.0 0.0 10.0 2020-12-31 18:30:47 -0500 ocd-division/country:us/state:mi/cd:3 None None None MI 3 False 2603 0.00 58.49 41.51
7 A000367 Representative Rep. https://api.propublica.org/congress/v1/members/A000367.json Justin None Amash None 1980-04-18 M R justinamash repjustinamash repjustinamash 412438 1033767 105566 21143 N00031938 /m/0c00p_n H0MI03126 https://amash.house.gov https://amash.house.gov/rss.xml None False None 0.654 None 10 2020 430.0 0.0 5.0 2020-12-28 21:04:36 -0500 ocd-division/country:us/state:mi/cd:3 106 Cannon House Office Building 202-225-3831 None MI 3 False 2603 0.00 61.97 37.79
8 A000369 Representative Rep. https://api.propublica.org/congress/v1/members/A000369.json Mark None Amodei None 1958-06-12 M R None MarkAmodeiNV2 MarkAmodeiNV2 markamodeinv2 412500 62817 12537 21196 N00031177 /m/03bzdkn H2NV02395 https://amodei.house.gov https://amodei.house.gov/rss/news-releases.xml None False R+7 0.384 None 10 2020 954.0 36.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:nv/cd:2 104 Cannon House Office Building 202-225-6155 None NV 2 False 3202 3.77 92.63 7.26
9 A000377 Representative Rep. https://api.propublica.org/congress/v1/members/A000377.json Kelly None Armstrong None 1976-10-08 M R None RepArmstrongND None None 412794 None 139338 None N00042868 /g/11hcszksh3 H8ND00096 https://armstrong.house.gov None None False R+16 NaN None 2 2020 954.0 33.0 0.0 2020-12-31 18:30:49 -0500 ocd-division/country:us/state:nd/cd:1 1004 Longworth House Office Building 202-225-2611 None ND At-Large True 3800 3.46 93.31 6.58


Related Topics



Leave a reply



Submit