R: How to Expand a Row Containing a "List" to Several Rows...One for Each List Member

R: how to expand a row containing a list to several rows...one for each list member?

I've grown to really love data.table for this kind of task. It is so very simple. But first, let's make some sample data (which you should provide idealy!)

#  Sample data
set.seed(1)
df = data.frame( pep = replicate( 3 , paste( sample(999,3) , collapse=";") ) , pro = sample(3) , stringsAsFactors = FALSE )

Now we use the data.table package to do the reshaping in a couple of lines...

#  Load data.table package
require(data.table)

# Turn data.frame into data.table, which looks like..
dt <- data.table(df)
# pep pro
#1: 266;372;572 1
#2: 908;202;896 3
#3: 944;660;628 2

# Transform it in one line like this...
dt[ , list( pep = unlist( strsplit( pep , ";" ) ) ) , by = pro ]
# pro pep
#1: 1 266
#2: 1 372
#3: 1 572
#4: 3 908
#5: 3 202
#6: 3 896
#7: 2 944
#8: 2 660
#9: 2 628

Expand each row in R dataframe with multiple rows

First, some working (but not very good code):

require(tidyverse)
out_df <-
list.files(path='.', pattern='*.foo', recursive=TRUE) %>%
map(~readLines(file(.x))) %>%
setNames(fnames) %>%
t %>%
as.data.frame %>%
gather(file, lines) %>%
unnest()

out_df

This is a tidyverse-style command to generate the data that I think you want. Since I don't have your input files, I made up these sample files:

contents of f1.foo

line_1_f1
line_2_f1

contents of f2.foo

line_1_f2
line_2_f2
line_3_f2

Changes relative to your approach:

  1. Avoid the use of the built-in function file() as a column name. I used fname instead.
  2. Don't use system to read the files, there is built-in R functions to do that. Use of system() needlessly makes porting your code to other operating systems far more unlikely to succeed.
  3. Build the data frame after all the data is read into R, not before. Because of the way non-standard evaluation with dplyr works, it's hard to use readLines(...) inside of a mutate() where the file connection to be read varies.

  4. Use purrr::map() to generate a list of lists of file content lines from a list of filenames. This is a tidyverse way of writing a for-loop.

  5. Set the names of the list elements with setNames().
  6. Munge this list into a data.frame using t() and as.data.frame()
  7. Tidy the data with gather() to collapse the data frame that has one column per file into a data frame with one file per row.
  8. Expand the list using unnest().

I don't think this approach is very pretty, but it works. Another approach that avoids the ugly steps 5 and 6 is a for loop.

fnames <- list.files(path='.', pattern='*.foo', recursive=TRUE)

out_df <- data.frame(fname = c(), lines=c())
for(fname in fnames){
fcontents <- readLines(file(fname)) %>% as.character
this_df <- data.frame(fname = fname, lines = fcontents)
out_df <- bind_rows(out_df, this_df)
}

The output in either case is

   fname     lines
1 f1.foo line_1_f1
2 f1.foo line_2_f1
3 f2.foo line_1_f2
4 f2.foo line_2_f2
5 f2.foo line_3_f2

Expand nested dataframe cell in long format

You can use unnest() in tidyr to expand a nested column.

tidyr::unnest(df, part_list)

# # A tibble: 3 x 2
# chapterid part_list
# <chr> <chr>
# 1 a c
# 2 a d
# 3 b e

Data

df <- data.frame(chapterid = c("a", "b"))
df$part_list <- list(c("c", "d"), "e")

# chapterid part_list
# 1 a c, d
# 2 b e

How to expand a data.frame containing a column with a list

You should use unnest from tidyr or you can map the tidyverse, sorry about earlier, I have mapped tidyverse .Rprofile file. Anyways

library(tidyverse) #or map library(tidyr) whatever suits you
d %>%
unnest(children) %>%
mutate(id = 1:row_number()) #You may not want to run this if you want to keep your original id

Also in base R way:

d_new <- do.call('rbind', do.call('Map', c(data.frame, d)))

Again , to reset id, we have to use 1:nrow(d)

d_new$id <- 1:nrow(d) #This may not be required if you don't want to reset your id

Thanks for all the comments , they are all welcomed, and Yes I am overwhelmed.
Thanks to @Ronak Shah, @r2evans

Expanding a matrix to include rows for each element in an interval

We may use map2 to get the sequence between the two columns as a list and then unnest the list column

library(dplyr)
library(purrr)
library(tidyr)
a %>%
transmute(country, incident, year = map2(start.year, end.year, `:`)) %>%
unnest(year)

-output

# A tibble: 7 × 3
country incident year
<chr> <chr> <int>
1 AAA disaster 1990
2 AAA disaster 1991
3 AAA disaster 1992
4 AAA disaster 1993
5 BBB disaster 1995
6 CCC disaster 2011
7 CCC disaster 2012

If the 'country' column is unique, either use a group by/summarise or use rowwise to expand as well

a %>% 
group_by(country) %>%
summarise(incident, year = start.year:end.year, .groups = 'drop')
# A tibble: 7 × 3
country incident year
<chr> <chr> <int>
1 AAA disaster 1990
2 AAA disaster 1991
3 AAA disaster 1992
4 AAA disaster 1993
5 BBB disaster 1995
6 CCC disaster 2011
7 CCC disaster 2012

Or use uncount to expand the data

a %>% 
uncount(end.year - start.year + 1) %>%
group_by(country) %>%
mutate(year = start.year + row_number() - 1, .keep = 'unused',
end.year = NULL) %>%
ungroup

Extend/expand data frame with column of lists each into a row

We can remove the first column (df[-1]), loop over the other columns, unlist and then convert the list to data.frame

lst <- lapply(df[-1], unlist)
dfN <- data.frame(lst)

Expand multiple columns of data.table containing list observations

So one way to think about the problem is to process the list columns using an lapply to expand each separately and store into a list of data.tables and then merge all of those in the list at once.

To create the list of expanded variables you would just do the following:

    expandcols<-c("origins","destinations")

lapply(expandcols,function(i) rbindlist(dt[[i]],idcol = "r")))

Also note that your original r column is a character vector and the idcol created by rbindlist is an integer so you will need consistency here. In my code I just converted your original to numeric.

To merge a list of data.tables I like to use the Reduce function like this:

     Reduce(function(...) merge(...,by="keys"), list())

The output will be one data.table where your key column is "r" and the list will be the result of the lapply call above. You can then merge the result with your original dataframe the data.table way. Putting it altogether the call would look like this:

    dtfinal<-Reduce(function(...) merge(...,by="r"),lapply(expandcols,function(i) rbindlist(dt[[i]],idcol = "r")))[dt[,-expandcols,with=F],on="r"]

Here is the code for the function I made:

    list_expander_fn<-function(X){
'%notin%'<-Negate('%in%')##Helpful for selecting column names later
expandcols_fun<-function(Y){##Main function to be called recursively as needed and takes in a data.table object as its only argument.
listcols<-colnames(Y)[which(sapply(Y,is.list))] #Identify list columns
listdt<-lapply(listcols,function(i) tryCatch(rbindlist(Y[[i]],idcol = "r"),error=function(e) NULL)) #Expand lists using rbindlist and returns null on error.

invalidlists<-which(sapply(listdt,is.null)) #Rbindlist does not work unless list elements contain data.tables

##Simply unlists if character vector is created like in destination and origin addresses columns
if(length(invalidlists)!=0){
Y[,listcols[invalidlists]:=lapply(.SD,unlist),.SDcols = listcols[invalidlists]]

listcols<-listcols[-invalidlists] ##Update list columns to be merged
listdt<-listdt[-invalidlists]##Removes NULL elements from the listdt.
}

origcols<-colnames(Y)[colnames(Y)%notin%listcols]##Identifies nonlist columns for final merge
currentdt<-Reduce(function(...) merge(...,by="r"),listdt) ##merges list of data.tables
return(currentdt[Y[,origcols,with=F],on="r"])
}

repeat{
currentexpand<-expandcols_fun(X) #Executes the expandcols_fun
listcheck<-sapply(currentexpand,is.list) #Checks again if lists still exist
if(sum(listcheck)!=0){
X<-currentexpand #Updates the X for recursive calls

} else{
break
}
}

return(currentexpand)
}

It works but there are issues with variable names because of the final field names (text and value). I could probably tinker with that a bit if you like where this is going. It works on 'rows2' but not 'rows'. The code to call it will be of course simple:

    finaldt<-list_expander_fn(dt)

Does that help answer your question? Let me know if you want me to add anything to the explanation. Good luck!

Pandas column of lists, create a row for each list element

UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().



lst_col = 'samples'

r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

Result:

In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3

PS here you may find a bit more generic solution


UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:

in the following line we are repeating values in one column N times where N - is the length of the corresponding list:

In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)

this can be generalized for all columns, containing scalar values:

In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2

[18 rows x 2 columns]

using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:

In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])

putting all this together:

In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31

[18 rows x 3 columns]

using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...

expand a data frame to have as many rows as range of two columns in original row

With dplyr, we can use rowwise with do

library(dplyr)
df1 %>%
rowwise() %>%
do(data.frame(symbol= .$symbol, value = .$start:.$end)) %>%
arrange(symbol)
# A tibble: 30 x 2
# symbol value
# <chr> <int>
# 1 a 7
# 2 a 8
# 3 a 9
# 4 a 10
# 5 a 11
# 6 i 8
# 7 i 9
# 8 i 10
# 9 i 11
#10 i 12
# ... with 20 more rows


Related Topics



Leave a reply



Submit