Split data.frame based on levels of a factor into new data.frames
I think that split
does exactly what you want.
Notice that X is a list of data frames, as seen by str
:
X <- split(df, df$g)
str(X)
If you want individual object with the group g names you could assign the elements of X from split
to objects of those names, though this seems like extra work when you can just index the data frames from the list split
creates.
#I used lapply just to drop the third column g which is no longer needed.
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, 1:2])
#Assign the dataframes in the list Y to individual objects
A <- Y[[1]]
B <- Y[[2]]
C <- Y[[3]]
D <- Y[[4]]
E <- Y[[5]]
#Or use lapply with assign to assign each piece to an object all at once
lapply(seq_along(Y), function(x) {
assign(c("A", "B", "C", "D", "E")[x], Y[[x]], envir=.GlobalEnv)
}
)
Edit Or even better than using lapply
to assign to the global environment use list2env
:
names(Y) <- c("A", "B", "C", "D", "E")
list2env(Y, envir = .GlobalEnv)
A
Split dataframe by levels of a factor and name dataframes by those levels
You can do it with the plyr
package
require(plyr)
dlply(df, .(Z))
Splitting data frame into segments for each factor based on a cutoff value in a column in R
In data.table
:
dt[, V1 := paste0("A.", 1+cumsum(V4 >= 0.4))]
In dplyr
:
df %>%
mutate(V1 = paste0("A.", 1+cumsum(V4 >= 0.4)))
Split a data.frame into smaller data.frames, based on the start and end indices (held in two vectors) and using a condition
Based on the answer to initial comment regarding row indices and using a similar 3-part approach like @Roland, the following should be what you want.
This creates a generic function to return all rows from "start" to "end" (assuming the provided elements are integers)
split_data <- function( start, end, dfr ){
dfr[start:end,]
}
This creates a list of ALL available splits.
split.frames <- mapply(split_data,START,END,MoreArgs=list(dfr=ALL_DATA))
This returns a logical vector with the i
th element equal to TRUE if the i
th split meets the desired condition.
cond <- sapply( split.frames, function(x){sum(x$Value)>=2} )
This returns only the splits that meet the condition.
split.frames <- split.frames[cond]
EDIT #1
Per the comment about saving off the splits, it is probably best to use the str_pad()
function from the R package stringr
for creating the file names, but here is a base R implementation that should work for you.
nchars <- nchar( length(split.frames) )
print.expr <- paste0("%0",nchars,"d")
for( i in 1:seq_along(split.frames) ){
file.i <- paste0( sprintf(print.expr,i), ".dat" )
write.table( split.frames[[i]], file=file.i, sep="\t", row.names=FALSE )
}
Not sure if you want column and/or row names in your saved outputs, but I assumed they were YES and NO respectively.
Split a data frame based on some criteria
if you want to have in the first rows the values that are in vec, you can create two data frames, one that corresponds to the values in vec and one where they're not.
Then, you can concatenate them with rbind:
in.vec = DF$col2 %in% vec
new.DF = rbind(DF[in.vec,], DF[!in.vec])
where DF[in.vec,]
selected the rows for which the values of col2 can be found in vec
.
Subsetting a data.frame based on factor levels in a second data.frame
df.1[,unique(df.2$Var[which(df.2$Info=="X1")])]
A C
1 0.8924861 0.7149490854
2 0.5711894 0.7200819517
3 0.7049629 0.0004052017
4 0.9188677 0.5007302717
5 0.3440664 0.9138259818
6 0.8657903 0.2724015017
7 0.7631228 0.5686033906
8 0.8388003 0.7377064163
9 0.0796059 0.6196693045
10 0.5029824 0.8717568610
Split/subset a data frame by factors in one column
We could use split
:
mylist <- split(df, df$State)
mylist
$AL
ID Rate State
1 1 24 AL
4 4 34 AL
$FL
ID Rate State
3 3 46 FL
6 6 99 FL
$MN
ID Rate State
2 2 35 MN
5 5 78 MN
To access elements number:
mylist[[1]]
or by name:
mylist$AL
ID Rate State
1 1 24 AL
4 4 34 AL
?split
Description
split divides the data in the vector x into the groups defined by f.
The replacement forms replace values corresponding to such a division.
unsplit reverses the effect of split.
Splitting a data frame based on character string
The main idea is to create a factor used to define the grouping for splitting. One way is by extracting the digits pattern form the provided variable Barcode
using regular expression. Then we convert the obtained character vector of digits to a factor with as.factor()
.
We can, of course, use other regular expression techniques to get the job done, or more user friendly wrapper functions from the stringr
package, like in the second example (the tidyverse
-ish approach).
Example 1
A base R solution using split
:
# The provided data
Barcode <- c("ABCD-1", "ABCC-1", "ABCD-2", "ABCC-2", "ABCD-3", "ABCC-3",
"ABCD-4", "ABCC-4", "ABCD-5", "ABCC-5","ABCD-6", "ABCC-6")
bar_f <- data.frame(Barcode)
factor_for_split <- regmatches(x = bar_f$Barcode,
m = regexpr(pattern = "[[:digit:]]",
text = bar_f$Barcode))
factor_for_split
#> [1] "1" "1" "2" "2" "3" "3" "4" "4" "5" "5" "6" "6"
# Create a list of 6 data frames as asked
lst <- split(x = bar_f, f = as.factor(factor_for_split))
lst
#> $`1`
#> Barcode
#> 1 ABCD-1
#> 2 ABCC-1
#>
#> $`2`
#> Barcode
#> 3 ABCD-2
#> 4 ABCC-2
#>
#> $`3`
#> Barcode
#> 5 ABCD-3
#> 6 ABCC-3
#>
#> $`4`
#> Barcode
#> 7 ABCD-4
#> 8 ABCC-4
#>
#> $`5`
#> Barcode
#> 9 ABCD-5
#> 10 ABCC-5
#>
#> $`6`
#> Barcode
#> 11 ABCD-6
#> 12 ABCC-6
# Edit names of the list
names(lst) <- paste0("df_", names(lst))
# Assign each data frame from the list to a data frame object in the global
# environment
for(name in names(lst)) {
assign(name, lst[[name]])
}
Created on 2020-02-24 by the reprex package (v0.3.0)
Example 2
And, if you prefer, here is a tidyverse
-ish approach:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
Barcode <- c("ABCD-1", "ABCC-1", "ABCD-2", "ABCC-2", "ABCD-3", "ABCC-3",
"ABCD-4", "ABCC-4", "ABCD-5", "ABCC-5","ABCD-6", "ABCC-6")
bar_f <- data.frame(Barcode)
bar_f %>%
mutate(factor_for_split = str_extract(string = Barcode,
pattern = "[[:digit:]]")) %>%
group_split(factor_for_split)
#> [[1]]
#> # A tibble: 2 x 2
#> Barcode factor_for_split
#> <fct> <chr>
#> 1 ABCD-1 1
#> 2 ABCC-1 1
#>
#> [[2]]
#> # A tibble: 2 x 2
#> Barcode factor_for_split
#> <fct> <chr>
#> 1 ABCD-2 2
#> 2 ABCC-2 2
#>
#> [[3]]
#> # A tibble: 2 x 2
#> Barcode factor_for_split
#> <fct> <chr>
#> 1 ABCD-3 3
#> 2 ABCC-3 3
#>
#> [[4]]
#> # A tibble: 2 x 2
#> Barcode factor_for_split
#> <fct> <chr>
#> 1 ABCD-4 4
#> 2 ABCC-4 4
#>
#> [[5]]
#> # A tibble: 2 x 2
#> Barcode factor_for_split
#> <fct> <chr>
#> 1 ABCD-5 5
#> 2 ABCC-5 5
#>
#> [[6]]
#> # A tibble: 2 x 2
#> Barcode factor_for_split
#> <fct> <chr>
#> 1 ABCD-6 6
#> 2 ABCC-6 6
#>
#> attr(,"ptype")
#> # A tibble: 0 x 2
#> # ... with 2 variables: Barcode <fct>, factor_for_split <chr>
names(lst) <- paste0("df_", 1:length(lst))
for(name in names(lst)) {
assign(name, lst[[name]])
Created on 2020-02-24 by the reprex package (v0.3.0)
Related Topics
Formatting Decimal Places in R
Add Column Which Contains Binned Values of a Numeric Column
Formula With Dynamic Number of Variables
How to Convert Variable With Mixed Date Formats to One Format
Create Counter Within Consecutive Runs of Certain Values
Too Much White Space Between Caption and Figure Produced by Tikzdevice and Ggplot2 in Latex
"For" Loop Only Adds the Final Ggplot Layer
Changing Column Names of a Data Frame
Combine Two Lists in a Dataframe in R
Reorder Bars in Geom_Bar Ggplot2 by Value
Include Levels of Zero Count in Result of Table()
Filter Multiple Values on a String Column in Dplyr
Can Lists Be Created That Name Themselves Based on Input Object Names
Generating All Distinct Permutations of a List in R
How to Specify the Size of a Graph in Ggplot2 Independent of Axis Labels
How to Succinctly Write a Formula With Many Variables from a Data Frame