Unnesting a List of Lists in a Data Frame Column

Unnesting a list of lists in a data frame column

Note: Ignore the original and Update 1; Update 2 is better with the current state of the tidyverse.


Original:

With purrr, which is nice for lists,

library(purrr)

df %>% dmap(unlist)
## # A tibble: 2 x 2
## x y
## <dbl> <dbl>
## 1 1 1
## 2 1 2

which is more or less equivalent to

as.data.frame(lapply(df, unlist))
## x y
## a 1 1
## b 1 2

Update 1:

dmap has been deprecated and moved to purrrlyr, the home of interesting but ill-fated functions that will now shout lots of deprecation warnings at you. You could translate the base R idiom to tidyverse:

df %>% map(unlist) %>% as_tibble()

which will work fine for this case, but not for more than one row (a problem all these approaches face). A more robust solution might be

library(tidyverse)

df %>% bind_rows(df) %>% # make larger sample data
mutate_if(is.list, simplify_all) %>% # flatten each list element internally
unnest() # expand
#> # A tibble: 4 × 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 1 2
#> 3 1 1
#> 4 1 2

Update 2:

At some point since this was asked, tidyr::unnest() got updated such that it doesn't error anymore, so you can just do

df %>%
unnest(y) %>%
unnest(y)
#> # A tibble: 2 × 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 1 2

If you care about the names in the list, pull them out first and then unnest the names and the list at the same time:

df %>%
mutate(label = map(y, names)) %>%
unnest(c(y, label)) %>%
unnest(y)
#> # A tibble: 2 × 3
#> x y label
#> <dbl> <dbl> <chr>
#> 1 1 1 a
#> 2 1 2 b

I'll leave the previous answers for continuity, but this is simpler.

Unnest list of lists of data frames, containing NAs

I create a helper function to combine p and c:

foo <- function(x) {
a <- x[[1]]
b <- x[[2]]
if (nrow(b) == 0) b[1, ] <- NA
return(cbind(a, b))
}

Then I run the helper function on each element and bind the rows:

do.call(rbind, lapply(mylist, foo))

The result:

> do.call(rbind, lapply(mylist, foo))
id text from
1 01 one A
2 01 two B
3 02 three C
4 02 four D
5 02 five E
6 03 <NA> <NA>

P.S. The same result using the R base pipe:

lapply(mylist, foo) |> do.call(what = rbind)

R: How to unnest a nested list into data.frame?

We could also use as.data.frame and this gets the correct type

out <- map_dfr(l12, as.data.frame)
str(out)
#'data.frame': 2 obs. of 3 variables:
# $ SeriousDlqin2yrs.prediction : chr "0" "1"
# $ SeriousDlqin2yrs.prediction_probs.0: num 0.5 0.6
# $ SeriousDlqin2yrs.prediction_probs.1: num 0.5 0.4

Or in base R

do.call(rbind, lapply(l12, as.data.frame))
# SeriousDlqin2yrs.prediction SeriousDlqin2yrs.prediction_probs.0 SeriousDlqin2yrs.prediction_probs.1
#1 0 0.5 0.5
#2 1 0.6 0.4

unnest list of lists of different lengths to dataframe

Would this work for you ?

library(jsonlite)
library(tidyverse)
data = fromJSON("http://search.worldbank.org/api/v2/wds?format=json&fl=abstracts,admreg,alt_title,authr,available_in,bdmdt,chronical_docm_id,closedt,colti,count,credit_no,disclosure_date,disclosure_type,disclosure_type_date,disclstat,display_title,docdt,docm_id,docna,docty,dois,entityid,envcat,geo_reg,geo_reg,geo_reg_and_mdk,guid,historic_topic,id,isbn,ispublicdocs,issn,keywd,lang,listing_relative_url,lndinstr,loan_no,majdocty,majtheme,ml_abstract,ml_display_title,new_url,owner,pdfurl,prdln,projectid,projn,publishtoextweb_dt,repnb,repnme,seccl,sectr,src_cit,subsc,subtopic,teratopic,theme,topic,topicv3,totvolnb,trustfund,txturl,unregnbr,url_friendly_title,versiontyp,versiontyp_key,virt_coll,vol_title,volnb&str_docdt=1986-01-01&end_docdt=2000-12-31&rows=500&os=1&srt=docdt&order=desc")

df <-
data$documents %>%
head(-1) %>% # remove facet element
transpose %>% # transpose so each subelement is now a main element
as_tibble %>% # convert to table
purrr::modify(~replace(.x,lengths(.x)==0,list(NA))) %>% # replace empty elements by list(NA) so they have length 1 too
modify_if(~all(lengths(.x)==1),unlist) # unlist lists that contain only items of length 1

Only one list column remains:

names(df)[map_chr(df,class) == "list"]
# [1] "keywd"

As it contains items of length 1 or 2:

table(lengths(df$keywd))
# 1 2
# 224 276

Here's what the output looks like:

glimpse(df)

# Observations: 500
# Variables: 38
# $ url <chr> "http://documents.worldbank.org/curated/en/903231468764970044/Attacking-rural-poverty-strategy-and-public-actions", "...
# $ available_in <chr> "English", "English", "English", "English", "English", "English,French,Spanish,Portuguese", "Portuguese,Chinese,Engli...
# $ url_friendly_title <chr> "http://documents.worldbank.org/curated/en/903231468764970044/Attacking-rural-poverty-strategy-and-public-actions", "...
# $ new_url <chr> "2000/12/1000476/Attacking-rural-poverty-strategy-and-public-actions", "2000/12/1000501/State-policies-and-womens-aut...
# $ guid <chr> "903231468764970044", "429001468753367328", "985531468746683502", "890081468757236671", "922151468776107524", "324581...
# $ disclosure_date <chr> "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z...
# $ disclosure_type <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA...
# $ disclosure_type_date <chr> "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z...
# $ publishtoextweb_dt <chr> "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z", "2010-07-01T00:00:00Z...
# $ docm_id <chr> "090224b0828c737a", "090224b0828ac316", "090224b0828bd3f7", "090224b0828ac343", "090224b0828cf43d", "090224b0828cf42b...
# $ chronical_docm_id <chr> "090224b0828c737a", "090224b0828ac316", "090224b0828bd3f7", "090224b0828ac343", "090224b0828cf43d", "090224b0828cf42b...
# $ txturl <chr> "http://documents.worldbank.org/curated/en/903231468764970044/text/multi-page.txt", "http://documents.worldbank.org/c...
# $ pdfurl <chr> "http://documents.worldbank.org/curated/en/903231468764970044/pdf/multi-page.pdf", "http://documents.worldbank.org/cu...
# $ docdt <chr> "2000-12-31T00:00:00Z", "2000-12-31T00:00:00Z", "2000-12-31T00:00:00Z", "2000-12-31T00:00:00Z", "2000-12-31T00:00:00Z...
# $ totvolnb <chr> "1", "1", "1", "1", "5", "1", "1", "14", "1", "1", "1", "1", "14", "14", "14", "14", "14", "14", "14", "14", "14", "1...
# $ versiontyp <chr> "Final", "Final", "Final", "Final", "Final", "Final", "Final", "Final", "Final", "Final", "Final", "Final", "Final", ...
# $ versiontyp_key <chr> "1309935", "1309935", "1309935", "1309935", "1309935", "1309935", "1309935", "1309935", "1309935", "1309935", "130993...
# $ volnb <chr> "1", "1", "1", "1", "4", "1", "1", "8", "1", "1", "1", "1", "13", "4", "9", "12", "3", "2", "7", "10", "1", "6", "11"...
# $ repnme <chr> "Attacking rural poverty : strategy and\n public actions", "State policies and women's autonomy in\n ...
# $ abstracts <chr> "Poverty remains pervasive, and its\n incidence and intensity are usually higher in rural than in\n ...
# $ display_title <chr> "Attacking rural poverty :\n strategy and public actions", "State policies and women's\n autono...
# $ listing_relative_url <chr> "/research/2000/12/1000476/attacking-rural-poverty-strategy-public-actions", "/research/2000/12/1000501/state-policie...
# $ docty <chr> "Newsletter", "Working Paper (Numbered Series)", "Publication", "Poverty Reduction Strategy Paper (PRSP)", "Environme...
# $ subtopic <chr> "Economic Theory & Research,Rural Settlements,Industrial Economics,Nutrition,Educational Sciences,Economic Growth,Agr...
# $ docna <chr> "Attacking rural poverty : strategy and\n public actions", "State policies and women's autonomy in\n ...
# $ teratopic <chr> "Poverty Reduction", "Education", "Energy", "Poverty Reduction", "Industry,Transport,Water Resources", NA, "Governanc...
# $ authors <chr> "Okidegbe, Nwanze", "Zhang, Xiaodan", "Bogach, V. Susan", NA, "Carl Brothers International Inc.", "World Bank", "Mann...
# $ entityids <chr> "000094946_01022305364180", "000094946_01022705322025", "000094946_01011005520622", "000094946_0102240538258", "00009...
# $ subsc <chr> "Macro/Non-Trade", "Human Development", "(Historic)Other power and energy conversion", "(Historic)Macro/non-trade", "...
# $ lang <chr> "English", "English", "English", "English", "English", "Portuguese", "English", "English", "Chinese", "English", "Eng...
# $ historic_topic <chr> "Poverty Reduction", "Education", "Energy", "Poverty Reduction", "Industry,Transport,Water Resources", NA, "Governanc...
# $ seccl <chr> "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public", "Public...
# $ sectr <chr> "(Historic)Economic Policy", "(Historic)Multisector", "(Historic)Electric Power & Other Energy", "(Historic)Economic ...
# $ majdocty <chr> "Publications & Research", "Publications & Research", "Publications,Publications & Research", "Country Focus", "Proje...
# $ src_cit <chr> "Rural development note. -- No. 6 (December 2000)", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
# $ keywd <list> [[["Rural Poor;medium term expenditure\n framework;rural poverty reduction strategy;rural\n ar...
# $ owner <chr> "Environ & Soc Sustainable Dev VP (ESD)", "Off of Sr VP Dev Econ/Chief Econ (DECVP)", "Energy & Mining Sector Unit (E...
# $ repnb <chr> "21649", "21743", "WTP492", "21834", "E287", "27779", "21604", "E425", "21604", "22194", "21837", "22903", "E425", "E...

How to unnest a list containing data frames

One of the issue is that there are nested data.frame within each column

library(tidyverse)
df %>%
mutate(json = map(json, ~ if(is.null(.x))
tibble(attributes.StreetName = NA_character_, attributes.Match_addr = NA_character_)
else do.call(data.frame, c(.x, stringsAsFactors = FALSE)))) %>%
unnest
# A tibble: 5 x 7
# full_address attributes.StreetNa… attributes.Match_ad… address location.x location.y score
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 2379 ADDISON BLVD, HIGH POINT, … <NA> <NA> <NA> NA NA NA
#2 1751 W LEXINGTON AVE, HIGH POIN… <NA> <NA> <NA> NA NA NA
#3 2514 WILLARD DAIRY RD, HIGH POI… WILLARD DAIRY 2514 WILLARD DAIRY 2514 WILLARD DAI… -80.0 36.0 92.8
#4 126 MARYWOOD DR, HIGH POINT, NC… MARYWOOD 126 MARYWOOD, HIGH … 126 MARYWOOD, HI… -80.0 36.0 97.2
#5 508 EDNEY RIDGE RD, GREENSBORO,… EDNEY RIDGE 508 EDNEY RIDGE RD 508 EDNEY RIDGE … -79.8 36.1 100

Or using map_if

f1 <- function(dat) {
dat %>%
flatten

}

f2 <- function(dat) {
tibble(attributes.StreetName = NA_character_,
attributes.Match_addr = NA_character_)
}

df %>%
mutate(json = map_if(json, is.data.frame, f1, .else = f2)) %>%
unnest
# A tibble: 5 x 7
# full_address attributes.StreetNa… attributes.Match_ad… address score location.x location.y
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 2379 ADDISON BLVD, HIGH POINT, … <NA> <NA> <NA> NA NA NA
#2 1751 W LEXINGTON AVE, HIGH POIN… <NA> <NA> <NA> NA NA NA
#3 2514 WILLARD DAIRY RD, HIGH POI… WILLARD DAIRY 2514 WILLARD DAIRY 2514 WILLARD DAI… 92.8 -80.0 36.0
#4 126 MARYWOOD DR, HIGH POINT, NC… MARYWOOD 126 MARYWOOD, HIGH … 126 MARYWOOD, HI… 97.2 -80.0 36.0
#5 508 EDNEY RIDGE RD, GREENSBORO,… EDNEY RIDGE 508 EDNEY RIDGE RD 508 EDNEY RIDGE … 100 -79.8 36.1

How to turn a list of lists into columns of a pandas dataframe?

You can try using df.explode and df.apply:

import pandas as pd

df = pd.DataFrame(data= {'Generation': 0, 'Route_set':[[[20., 19., 47., 56.], [21., 34., 78., 34.]]]})
df['route1']=df['Route_set'].apply(lambda x: x[0])
df['route2']=df['Route_set'].apply(lambda x: x[1])
df = df.explode(['route1', 'route2'], ignore_index=True)
df2 = df[df.columns.difference(['Route_set', 'Generation'])]
|    |   route1 |   route2 |
|---:|---------:|---------:|
| 0 | 20 | 21 |
| 1 | 19 | 34 |
| 2 | 47 | 78 |
| 3 | 56 | 34 |

Or you can just create a new dataframe with the values like this:

import pandas as pd

df = pd.DataFrame(data= {'Generation': 0, 'Route_set':[[[20., 19., 47., 56.], [21., 34., 78., 34.]]]})
df1 = pd.DataFrame.from_dict(dict(zip(['route1', 'route2'], df.Route_set.to_numpy()[0])), orient='index').transpose()
|    |   route1 |   route2 |
|---:|---------:|---------:|
| 0 | 20 | 21 |
| 1 | 19 | 34 |
| 2 | 47 | 78 |
| 3 | 56 | 34 |

Update 1:

import pandas as pd

df = pd.DataFrame(data= {'Generation': 0, 'Route_set':[
[[20.0, 19.0, 47.0, 56.0, 43.0, 53.0, 18.0, -1.0, -1.0, -1.0, -1.0, -1.0], [20.0, 51.0, 46.0, 37.0, 2.0, 57.0, 49.0, 36.0, 25.0, 5.0, 4.0, 34.0], [54.0, 23.0, 5.0, 46.0, 34.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [57.0, 48.0, 46.0, 35.0, 25.0, 27.0, 52.0, 8.0, 39.0, 22.0, 51.0, 28.0], [57.0, 16.0, 45.0, 25.0, 49.0, 38.0, 0.0, 46.0, 13.0, 18.0, 19.0, 20.0], [21.0, 11.0, 6.0, 33.0, 25.0, 49.0, 57.0, 29.0, 12.0, 3.0, -1.0, -1.0], [9.0, 15.0, 47.0, 42.0, 25.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [51.0, 25.0, 22.0, 14.0, 39.0, 8.0, 40.0, 0.0, 10.0, 26.0, 32.0, 47.0], [1.0, 33.0, 24.0, 46.0, 56.0, 30.0, 48.0, 51.0, -1.0, -1.0, -1.0, -1.0], [25.0, 31.0, 50.0, 17.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [57.0, 12.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [20.0, 41.0, 47.0, 15.0, 46.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [14.0, 44.0, 39.0, 25.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [20.0, 51.0, 25.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [57.0, 49.0, 5.0, 20.0, 37.0, 46.0, 36.0, 25.0, 39.0, 51.0, 48.0, -1.0], [5.0, 0.0, 33.0, 55.0, 25.0, 48.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [51.0, 32.0, 33.0, 24.0, 35.0, 8.0, 25.0, 4.0, 46.0, 1.0, 7.0, -1.0], [5.0, 25.0, 34.0, 46.0, 1.0, 9.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [38.0, 57.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0], [12.0, 57.0, 49.0, 25.0, 9.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0]],
]})

data = df.Route_set.to_numpy()[0]

df = pd.DataFrame.from_dict(dict(zip(['route{}'.format(i) for i in range(1, len(data)+1)], [data[i] for i in range(len(data))])), orient='index').transpose()
df = df.apply(lambda x: x.explode() if 'route' in x.name else x)

df[sorted(df.columns)]
print(df.to_markdown())
|    |   route1 |   route2 |   route3 |   route4 |   route5 |   route6 |   route7 |   route8 |   route9 |   route10 |   route11 |   route12 |   route13 |   route14 |   route15 |   route16 |   route17 |   route18 |   route19 |   route20 |
|---:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|---------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|
| 0 | 20 | 20 | 54 | 57 | 57 | 21 | 9 | 51 | 1 | 25 | 57 | 20 | 14 | 20 | 57 | 5 | 51 | 5 | 38 | 12 |
| 1 | 19 | 51 | 23 | 48 | 16 | 11 | 15 | 25 | 33 | 31 | 12 | 41 | 44 | 51 | 49 | 0 | 32 | 25 | 57 | 57 |
| 2 | 47 | 46 | 5 | 46 | 45 | 6 | 47 | 22 | 24 | 50 | -1 | 47 | 39 | 25 | 5 | 33 | 33 | 34 | -1 | 49 |
| 3 | 56 | 37 | 46 | 35 | 25 | 33 | 42 | 14 | 46 | 17 | -1 | 15 | 25 | -1 | 20 | 55 | 24 | 46 | -1 | 25 |
| 4 | 43 | 2 | 34 | 25 | 49 | 25 | 25 | 39 | 56 | -1 | -1 | 46 | -1 | -1 | 37 | 25 | 35 | 1 | -1 | 9 |
| 5 | 53 | 57 | -1 | 27 | 38 | 49 | -1 | 8 | 30 | -1 | -1 | -1 | -1 | -1 | 46 | 48 | 8 | 9 | -1 | -1 |
| 6 | 18 | 49 | -1 | 52 | 0 | 57 | -1 | 40 | 48 | -1 | -1 | -1 | -1 | -1 | 36 | -1 | 25 | -1 | -1 | -1 |
| 7 | -1 | 36 | -1 | 8 | 46 | 29 | -1 | 0 | 51 | -1 | -1 | -1 | -1 | -1 | 25 | -1 | 4 | -1 | -1 | -1 |
| 8 | -1 | 25 | -1 | 39 | 13 | 12 | -1 | 10 | -1 | -1 | -1 | -1 | -1 | -1 | 39 | -1 | 46 | -1 | -1 | -1 |
| 9 | -1 | 5 | -1 | 22 | 18 | 3 | -1 | 26 | -1 | -1 | -1 | -1 | -1 | -1 | 51 | -1 | 1 | -1 | -1 | -1 |
| 10 | -1 | 4 | -1 | 51 | 19 | -1 | -1 | 32 | -1 | -1 | -1 | -1 | -1 | -1 | 48 | -1 | 7 | -1 | -1 | -1 |
| 11 | -1 | 34 | -1 | 28 | 20 | -1 | -1 | 47 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |

How to unnest column-list?

You can do this by coercing the elements in the list column to data frames arranged as you like, which will unnest nicely:

library(tidyverse)

tibble(a = c('first', 'second'),
b = list(c('colA' = 1, 'colC' = 2), c('colA'= 3, 'colB'=2))) %>%
mutate(b = invoke_map(tibble, b)) %>%
unnest()
#> # A tibble: 2 x 4
#> a colA colC colB
#> <chr> <dbl> <dbl> <dbl>
#> 1 first 1. 2. NA
#> 2 second 3. NA 2.

Doing the coercion is a little tricky, though, as you don't want to end up with a 2x1 data frame. There are various ways around this, but a direct route is purrr::invoke_map, which calls a function with purrr::invoke (like do.call) on each element in a list.

Unnest one of several list columns in dataframe

According to unnest, the argument ...

Specification of columns to nest. Use bare variable names or
functions of variables. If omitted, defaults to all list-cols.

Therefore, we could specify the column name to be unnested after the rename_all

iris %>
... #op's code
...

rename_all(funs(str_c("Mean.", .))))) %>%
unnest(sum_data)
# A tibble: 3 x 6
# Species data Mean.Sepal.Length Mean.Sepal.Width Mean.Petal.Length Mean.Petal.Width
# <fctr> <list> <dbl> <dbl> <dbl> <dbl>
#1 setosa <tibble [50 x 4]> 5.01 3.43 1.46 0.246
#2 versicolor <tibble [50 x 4]> 5.94 2.77 4.26 1.33
#3 virginica <tibble [50 x 4]> 6.59 2.97 5.55 2.03


Related Topics



Leave a reply



Submit