R Subsetting a Data Frame into Multiple Data Frames Based on Multiple Column Values

R subsetting a data frame into multiple data frames based on multiple column values

You are looking for split

split(df, with(df, interaction(v1,v2)), drop = TRUE)
$E.X
v1 v2 v3 v4 v5
3 E X 2 12 15
5 E X 2 14 16

$D.Y
v1 v2 v3 v4 v5
2 D Y 10 12 8

$A.Z
v1 v2 v3 v4 v5
1 A Z 1 10 12

As noted in the comments

any of the following would work

library(microbenchmark)
microbenchmark(
split(df, list(df$v1,df$v2), drop = TRUE),
split(df, interaction(df$v1,df$v2), drop = TRUE),
split(df, with(df, interaction(v1,v2)), drop = TRUE))

Unit: microseconds
expr min lq median uq max neval
split(df, list(df$v1, df$v2), drop = TRUE) 1119.845 1129.3750 1145.8815 1182.119 3910.249 100
split(df, interaction(df$v1, df$v2), drop = TRUE) 893.749 900.5720 909.8035 936.414 3617.038 100
split(df, with(df, interaction(v1, v2)), drop = TRUE) 895.150 902.5705 909.8505 927.128 1399.284 100

It appears interaction is slightly faster (probably due the fact that the f = list(...) are just converted to an interaction within the function)


Edit

If you just want use the subset data.frames then I would suggest using data.table for ease of coding

library(data.table)

dt <- data.table(df)
dt[, plot(v4, v5), by = list(v1, v2)]

Subset dataframe into multiple based on multiple conditions

Remove the numbers from MS column and use it in split to split one dataframe into list of dataframes based on the pattern.

result <- split(D_MtC, sub('\\d+', '', D_MtC$MS))

where output from sub is :

sub('\\d+', '', D_MtC$MS)

#[1] "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl"
# "bl" "bu" "bu" "bu" "bu" "bu" "bu" "bu" "bu"

How to subset multiple data frames by a variable?

We can wrap with lapply

imps1 <- lapply(imps, subset, subset = gender == 1)
imps0 <- lapply(imps, subset, subset = gender == 0)

Or using tidyverse

library(dplyr)
library(purrr)
imps1 <- map(imps, ~ .x %>%
filter(gender == 1))

How to subset/split a dataframe of multiple columns by common number of values available in R

This should work to what you are doing, and it produces a list of data frames that you can index into one at a time:

c <- sapply(df[, 2:ncol(df)], function(x) sum(!is.na(x)))
x <- sapply(unique(c), function(x) which(x == c))
dfList <- list(); for(i in 1:length(x)) {dfList[[i]] <- df[, c(1, as.numeric(x[[i]]) + 1)]}

Output is as follows:

dfList
[[1]]
DATE A D E F
1 31/12/1999 79.5 36.7 3 6
2 03/01/2000 79.5 36.7 3 6
3 04/01/2000 79.5 36.7 3 6
4 05/01/2000 79.5 38.8 3 6
5 06/01/2000 79.5 20.3 3 6
6 07/01/2000 79.5 15.6 3 6
7 10/01/2000 79.5 5.4 3 6
8 11/01/2000 79.5 15.0 3 6
9 12/01/2000 79.5 9.3 3 6
10 13/01/2000 79.5 29.1 3 6

[[2]]
DATE B
1 31/12/1999 NA
2 03/01/2000 NA
3 04/01/2000 NA
4 05/01/2000 NA
5 06/01/2000 NA
6 07/01/2000 NA
7 10/01/2000 7
8 11/01/2000 7
9 12/01/2000 7
10 13/01/2000 7

[[3]]
DATE C G H
1 31/12/1999 NA NA NA
2 03/01/2000 NA NA NA
3 04/01/2000 325.0 961 3081.9
4 05/01/2000 322.5 945 2524.7
5 06/01/2000 327.5 952 3272.3
6 07/01/2000 327.5 941 2102.9
7 10/01/2000 327.5 946 2901.5
8 11/01/2000 327.5 888 9442.5
9 12/01/2000 331.5 870 7865.8
10 13/01/2000 334.0 853 7742.1

To retrieve only complete cases from each of the data frames in the data frame list above, you can do:

dfList <- sapply(dfList, function(x) x[complete.cases(x), ])

Resulting output will be the following list of the three data frames in this example:

[[1]]
DATE A D E F
1 31/12/1999 79.5 36.7 3 6
2 03/01/2000 79.5 36.7 3 6
3 04/01/2000 79.5 36.7 3 6
4 05/01/2000 79.5 38.8 3 6
5 06/01/2000 79.5 20.3 3 6
6 07/01/2000 79.5 15.6 3 6
7 10/01/2000 79.5 5.4 3 6
8 11/01/2000 79.5 15.0 3 6
9 12/01/2000 79.5 9.3 3 6
10 13/01/2000 79.5 29.1 3 6

[[2]]
DATE B
7 10/01/2000 7
8 11/01/2000 7
9 12/01/2000 7
10 13/01/2000 7

[[3]]
DATE C G H
3 04/01/2000 325.0 961 3081.9
4 05/01/2000 322.5 945 2524.7
5 06/01/2000 327.5 952 3272.3
6 07/01/2000 327.5 941 2102.9
7 10/01/2000 327.5 946 2901.5
8 11/01/2000 327.5 888 9442.5
9 12/01/2000 331.5 870 7865.8
10 13/01/2000 334.0 853 7742.1

You can access each of these data frames as follows:

for (i in 1:lenght(dfList)) {dfList[[i]]}

Split a large dataframe into a list of data frames based on common value in column

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

#  For reproducibile data
set.seed(1)

# Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )

# Split on userid
out <- split( df , f = df$userid )
#$`1`
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5

#$`2`
# userid data1 data2
#2 2 xfv 4
#4 2 bfe 10
#6 2 mrx 2
#8 2 fqd 9

Access each element using the [[ operator like this:

out[[1]]
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

sapply( out , function(x) mean( x$data2 ) )
# 1 2
#3.75 6.25


Related Topics



Leave a reply



Submit