R: how to expand a row containing a list to several rows...one for each list member?
I've grown to really love data.table
for this kind of task. It is so very simple. But first, let's make some sample data (which you should provide idealy!)
# Sample data
set.seed(1)
df = data.frame( pep = replicate( 3 , paste( sample(999,3) , collapse=";") ) , pro = sample(3) , stringsAsFactors = FALSE )
Now we use the data.table
package to do the reshaping in a couple of lines...
# Load data.table package
require(data.table)
# Turn data.frame into data.table, which looks like..
dt <- data.table(df)
# pep pro
#1: 266;372;572 1
#2: 908;202;896 3
#3: 944;660;628 2
# Transform it in one line like this...
dt[ , list( pep = unlist( strsplit( pep , ";" ) ) ) , by = pro ]
# pro pep
#1: 1 266
#2: 1 372
#3: 1 572
#4: 3 908
#5: 3 202
#6: 3 896
#7: 2 944
#8: 2 660
#9: 2 628
Expand each row in R dataframe with multiple rows
First, some working (but not very good code):
require(tidyverse)
out_df <-
list.files(path='.', pattern='*.foo', recursive=TRUE) %>%
map(~readLines(file(.x))) %>%
setNames(fnames) %>%
t %>%
as.data.frame %>%
gather(file, lines) %>%
unnest()
out_df
This is a tidyverse-style command to generate the data that I think you want. Since I don't have your input files, I made up these sample files:
contents of f1.foo
line_1_f1
line_2_f1
contents of f2.foo
line_1_f2
line_2_f2
line_3_f2
Changes relative to your approach:
- Avoid the use of the built-in function
file()
as a column name. I usedfname
instead. - Don't use
system
to read the files, there is built-in R functions to do that. Use ofsystem()
needlessly makes porting your code to other operating systems far more unlikely to succeed. Build the data frame after all the data is read into R, not before. Because of the way non-standard evaluation with
dplyr
works, it's hard to usereadLines(...)
inside of amutate()
where the file connection to be read varies.Use
purrr::map()
to generate a list of lists of file content lines from a list of filenames. This is a tidyverse way of writing a for-loop.- Set the names of the list elements with
setNames()
. - Munge this list into a data.frame using
t()
andas.data.frame()
- Tidy the data with
gather()
to collapse the data frame that has one column per file into a data frame with one file per row. - Expand the list using
unnest()
.
I don't think this approach is very pretty, but it works. Another approach that avoids the ugly steps 5 and 6 is a for loop.
fnames <- list.files(path='.', pattern='*.foo', recursive=TRUE)
out_df <- data.frame(fname = c(), lines=c())
for(fname in fnames){
fcontents <- readLines(file(fname)) %>% as.character
this_df <- data.frame(fname = fname, lines = fcontents)
out_df <- bind_rows(out_df, this_df)
}
The output in either case is
fname lines
1 f1.foo line_1_f1
2 f1.foo line_2_f1
3 f2.foo line_1_f2
4 f2.foo line_2_f2
5 f2.foo line_3_f2
Expand nested dataframe cell in long format
You can use unnest()
in tidyr
to expand a nested column.
tidyr::unnest(df, part_list)
# # A tibble: 3 x 2
# chapterid part_list
# <chr> <chr>
# 1 a c
# 2 a d
# 3 b e
Data
df <- data.frame(chapterid = c("a", "b"))
df$part_list <- list(c("c", "d"), "e")
# chapterid part_list
# 1 a c, d
# 2 b e
How to expand a data.frame containing a column with a list
You should use unnest
from tidyr
or you can map the tidyverse
, sorry about earlier, I have mapped tidyverse
.Rprofile file. Anyways
library(tidyverse) #or map library(tidyr) whatever suits you
d %>%
unnest(children) %>%
mutate(id = 1:row_number()) #You may not want to run this if you want to keep your original id
Also in base R way:
d_new <- do.call('rbind', do.call('Map', c(data.frame, d)))
Again , to reset id, we have to use 1:nrow(d)
d_new$id <- 1:nrow(d) #This may not be required if you don't want to reset your id
Thanks for all the comments , they are all welcomed, and Yes I am overwhelmed.
Thanks to @Ronak Shah, @r2evans
Expanding a matrix to include rows for each element in an interval
We may use map2
to get the sequence between the two columns as a list
and then unnest
the list
column
library(dplyr)
library(purrr)
library(tidyr)
a %>%
transmute(country, incident, year = map2(start.year, end.year, `:`)) %>%
unnest(year)
-output
# A tibble: 7 × 3
country incident year
<chr> <chr> <int>
1 AAA disaster 1990
2 AAA disaster 1991
3 AAA disaster 1992
4 AAA disaster 1993
5 BBB disaster 1995
6 CCC disaster 2011
7 CCC disaster 2012
If the 'country' column is unique
, either use a group by/summarise or use rowwise
to expand as well
a %>%
group_by(country) %>%
summarise(incident, year = start.year:end.year, .groups = 'drop')
# A tibble: 7 × 3
country incident year
<chr> <chr> <int>
1 AAA disaster 1990
2 AAA disaster 1991
3 AAA disaster 1992
4 AAA disaster 1993
5 BBB disaster 1995
6 CCC disaster 2011
7 CCC disaster 2012
Or use uncount
to expand the data
a %>%
uncount(end.year - start.year + 1) %>%
group_by(country) %>%
mutate(year = start.year + row_number() - 1, .keep = 'unused',
end.year = NULL) %>%
ungroup
Extend/expand data frame with column of lists each into a row
We can remove the first column (df[-1]
), loop over the other columns, unlist
and then convert the list
to data.frame
lst <- lapply(df[-1], unlist)
dfN <- data.frame(lst)
Expand multiple columns of data.table containing list observations
So one way to think about the problem is to process the list columns using an lapply to expand each separately and store into a list of data.tables and then merge all of those in the list at once.
To create the list of expanded variables you would just do the following:
expandcols<-c("origins","destinations")
lapply(expandcols,function(i) rbindlist(dt[[i]],idcol = "r")))
Also note that your original r column is a character vector and the idcol created by rbindlist is an integer so you will need consistency here. In my code I just converted your original to numeric.
To merge a list of data.tables I like to use the Reduce function like this:
Reduce(function(...) merge(...,by="keys"), list())
The output will be one data.table where your key column is "r" and the list will be the result of the lapply call above. You can then merge the result with your original dataframe the data.table way. Putting it altogether the call would look like this:
dtfinal<-Reduce(function(...) merge(...,by="r"),lapply(expandcols,function(i) rbindlist(dt[[i]],idcol = "r")))[dt[,-expandcols,with=F],on="r"]
Here is the code for the function I made:
list_expander_fn<-function(X){
'%notin%'<-Negate('%in%')##Helpful for selecting column names later
expandcols_fun<-function(Y){##Main function to be called recursively as needed and takes in a data.table object as its only argument.
listcols<-colnames(Y)[which(sapply(Y,is.list))] #Identify list columns
listdt<-lapply(listcols,function(i) tryCatch(rbindlist(Y[[i]],idcol = "r"),error=function(e) NULL)) #Expand lists using rbindlist and returns null on error.
invalidlists<-which(sapply(listdt,is.null)) #Rbindlist does not work unless list elements contain data.tables
##Simply unlists if character vector is created like in destination and origin addresses columns
if(length(invalidlists)!=0){
Y[,listcols[invalidlists]:=lapply(.SD,unlist),.SDcols = listcols[invalidlists]]
listcols<-listcols[-invalidlists] ##Update list columns to be merged
listdt<-listdt[-invalidlists]##Removes NULL elements from the listdt.
}
origcols<-colnames(Y)[colnames(Y)%notin%listcols]##Identifies nonlist columns for final merge
currentdt<-Reduce(function(...) merge(...,by="r"),listdt) ##merges list of data.tables
return(currentdt[Y[,origcols,with=F],on="r"])
}
repeat{
currentexpand<-expandcols_fun(X) #Executes the expandcols_fun
listcheck<-sapply(currentexpand,is.list) #Checks again if lists still exist
if(sum(listcheck)!=0){
X<-currentexpand #Updates the X for recursive calls
} else{
break
}
}
return(currentexpand)
}
It works but there are issues with variable names because of the final field names (text and value). I could probably tinker with that a bit if you like where this is going. It works on 'rows2' but not 'rows'. The code to call it will be of course simple:
finaldt<-list_expander_fn(dt)
Does that help answer your question? Let me know if you want me to add anything to the explanation. Good luck!
Pandas column of lists, create a row for each list element
UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode()
.
lst_col = 'samples'
r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N
times where N
- is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2
[18 rows x 2 columns]
using np.concatenate()
we can flatten all values in the list
column (samples
) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31
[18 rows x 3 columns]
using pd.DataFrame()[df.columns]
will guarantee that we are selecting columns in the original order...
expand a data frame to have as many rows as range of two columns in original row
With dplyr
, we can use rowwise
with do
library(dplyr)
df1 %>%
rowwise() %>%
do(data.frame(symbol= .$symbol, value = .$start:.$end)) %>%
arrange(symbol)
# A tibble: 30 x 2
# symbol value
# <chr> <int>
# 1 a 7
# 2 a 8
# 3 a 9
# 4 a 10
# 5 a 11
# 6 i 8
# 7 i 9
# 8 i 10
# 9 i 11
#10 i 12
# ... with 20 more rows
Related Topics
Aggregating Rows for Multiple Columns in R
Applying Function (Ks.Test) Between Two Data Frames Column-Wise in R
Summing Multiple Columns in an R Data-Frame Quickly
Merge Data Based on Nearest Date R
Use Endpoints Function to Get Start Points Instead
Select List Element Programmatically Using Name Stored as String
Splitting (1:N)[Boolean] into Contiguous Sequences
How to Change The Character Encoding of .R File in Rstudio
Group Data Frame by Pattern in R
Embed Instagram/Youtube into Shiny R App
Ggplotly Not Displaying Geom_Line Correctly
Linear Regression with Constraints on The Coefficients
R Eps Export and Import into Word 2010
How to Create a Continuous Legend (Color Bar Style) for Scale_Alpha