R subsetting a data frame into multiple data frames based on multiple column values
You are looking for split
split(df, with(df, interaction(v1,v2)), drop = TRUE)
$E.X
v1 v2 v3 v4 v5
3 E X 2 12 15
5 E X 2 14 16
$D.Y
v1 v2 v3 v4 v5
2 D Y 10 12 8
$A.Z
v1 v2 v3 v4 v5
1 A Z 1 10 12
As noted in the comments
any of the following would work
library(microbenchmark)
microbenchmark(
split(df, list(df$v1,df$v2), drop = TRUE),
split(df, interaction(df$v1,df$v2), drop = TRUE),
split(df, with(df, interaction(v1,v2)), drop = TRUE))
Unit: microseconds
expr min lq median uq max neval
split(df, list(df$v1, df$v2), drop = TRUE) 1119.845 1129.3750 1145.8815 1182.119 3910.249 100
split(df, interaction(df$v1, df$v2), drop = TRUE) 893.749 900.5720 909.8035 936.414 3617.038 100
split(df, with(df, interaction(v1, v2)), drop = TRUE) 895.150 902.5705 909.8505 927.128 1399.284 100
It appears interaction
is slightly faster (probably due the fact that the f = list(...)
are just converted to an interaction within the function)
Edit
If you just want use the subset data.frames then I would suggest using data.table for ease of coding
library(data.table)
dt <- data.table(df)
dt[, plot(v4, v5), by = list(v1, v2)]
Subset dataframe into multiple based on multiple conditions
Remove the numbers from MS
column and use it in split
to split one dataframe into list of dataframes based on the pattern.
result <- split(D_MtC, sub('\\d+', '', D_MtC$MS))
where output from sub
is :
sub('\\d+', '', D_MtC$MS)
#[1] "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl" "bl"
# "bl" "bu" "bu" "bu" "bu" "bu" "bu" "bu" "bu"
How to subset multiple data frames by a variable?
We can wrap with lapply
imps1 <- lapply(imps, subset, subset = gender == 1)
imps0 <- lapply(imps, subset, subset = gender == 0)
Or using tidyverse
library(dplyr)
library(purrr)
imps1 <- map(imps, ~ .x %>%
filter(gender == 1))
How to subset/split a dataframe of multiple columns by common number of values available in R
This should work to what you are doing, and it produces a list of data frames that you can index into one at a time:
c <- sapply(df[, 2:ncol(df)], function(x) sum(!is.na(x)))
x <- sapply(unique(c), function(x) which(x == c))
dfList <- list(); for(i in 1:length(x)) {dfList[[i]] <- df[, c(1, as.numeric(x[[i]]) + 1)]}
Output is as follows:
dfList
[[1]]
DATE A D E F
1 31/12/1999 79.5 36.7 3 6
2 03/01/2000 79.5 36.7 3 6
3 04/01/2000 79.5 36.7 3 6
4 05/01/2000 79.5 38.8 3 6
5 06/01/2000 79.5 20.3 3 6
6 07/01/2000 79.5 15.6 3 6
7 10/01/2000 79.5 5.4 3 6
8 11/01/2000 79.5 15.0 3 6
9 12/01/2000 79.5 9.3 3 6
10 13/01/2000 79.5 29.1 3 6
[[2]]
DATE B
1 31/12/1999 NA
2 03/01/2000 NA
3 04/01/2000 NA
4 05/01/2000 NA
5 06/01/2000 NA
6 07/01/2000 NA
7 10/01/2000 7
8 11/01/2000 7
9 12/01/2000 7
10 13/01/2000 7
[[3]]
DATE C G H
1 31/12/1999 NA NA NA
2 03/01/2000 NA NA NA
3 04/01/2000 325.0 961 3081.9
4 05/01/2000 322.5 945 2524.7
5 06/01/2000 327.5 952 3272.3
6 07/01/2000 327.5 941 2102.9
7 10/01/2000 327.5 946 2901.5
8 11/01/2000 327.5 888 9442.5
9 12/01/2000 331.5 870 7865.8
10 13/01/2000 334.0 853 7742.1
To retrieve only complete cases from each of the data frames in the data frame list above, you can do:
dfList <- sapply(dfList, function(x) x[complete.cases(x), ])
Resulting output will be the following list of the three data frames in this example:
[[1]]
DATE A D E F
1 31/12/1999 79.5 36.7 3 6
2 03/01/2000 79.5 36.7 3 6
3 04/01/2000 79.5 36.7 3 6
4 05/01/2000 79.5 38.8 3 6
5 06/01/2000 79.5 20.3 3 6
6 07/01/2000 79.5 15.6 3 6
7 10/01/2000 79.5 5.4 3 6
8 11/01/2000 79.5 15.0 3 6
9 12/01/2000 79.5 9.3 3 6
10 13/01/2000 79.5 29.1 3 6
[[2]]
DATE B
7 10/01/2000 7
8 11/01/2000 7
9 12/01/2000 7
10 13/01/2000 7
[[3]]
DATE C G H
3 04/01/2000 325.0 961 3081.9
4 05/01/2000 322.5 945 2524.7
5 06/01/2000 327.5 952 3272.3
6 07/01/2000 327.5 941 2102.9
7 10/01/2000 327.5 946 2901.5
8 11/01/2000 327.5 888 9442.5
9 12/01/2000 331.5 870 7865.8
10 13/01/2000 334.0 853 7742.1
You can access each of these data frames as follows:
for (i in 1:lenght(dfList)) {dfList[[i]]}
Split a large dataframe into a list of data frames based on common value in column
You can just as easily access each element in the list using e.g. path[[1]]
. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split
, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply
functions to further operate on each element in the list. Example below.
# For reproducibile data
set.seed(1)
# Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )
# Split on userid
out <- split( df , f = df$userid )
#$`1`
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5
#$`2`
# userid data1 data2
#2 2 xfv 4
#4 2 bfe 10
#6 2 mrx 2
#8 2 fqd 9
Access each element using the [[
operator like this:
out[[1]]
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5
Or use an *apply
function to do further operations on each list element. For instance, to take the mean of the data2
column you could use sapply like this:
sapply( out , function(x) mean( x$data2 ) )
# 1 2
#3.75 6.25
Related Topics
How Does Branch Prediction Affect Performance in R
Plotting Continuous and Discrete Series in Ggplot with Facet
Ggplot2 Each Group Consists of Only One Observation
How to Include Svg Image in PDF Document Rendered by Rmarkdown
Removing a List of Columns from a Data.Frame Using Subset
Reproduce a 'The Economist' Chart with Dual Axis
Setting Hex Bins in Ggplot2 to Same Size
Chain Arithmetic Operators in Dplyr with %>% Pipe
Extracting Nouns and Verbs from Text
Rank Per Row Over Multiple Columns in R
Percentage Histogram with Facet_Wrap
Rmarkdown Error "Attempt to Use Zero-Length Variable Name"
Can't Connect to Local MySQL Server Through Socket Error When Using Ssh Tunel
Error in R Gbm Function When Cv.Folds > 0