﻿ Split Data.Frame Based on Levels of a Factor into New Data.Frames - ITCodar

# Split Data.Frame Based on Levels of a Factor into New Data.Frames

## Split data.frame based on levels of a factor into new data.frames

I think that `split` does exactly what you want.

Notice that X is a list of data frames, as seen by `str`:

``X <- split(df, df\$g)str(X)``

If you want individual object with the group g names you could assign the elements of X from `split` to objects of those names, though this seems like extra work when you can just index the data frames from the list `split` creates.

``#I used lapply just to drop the third column g which is no longer needed.Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, 1:2]) #Assign the dataframes in the list Y to individual objectsA <- Y[]B <- Y[]C <- Y[]D <- Y[]E <- Y[]#Or use lapply with assign to assign each piece to an object all at oncelapply(seq_along(Y), function(x) {    assign(c("A", "B", "C", "D", "E")[x], Y[[x]], envir=.GlobalEnv)    })``

Edit Or even better than using `lapply` to assign to the global environment use `list2env`:

``names(Y) <- c("A", "B", "C", "D", "E")list2env(Y, envir = .GlobalEnv)A``

## Split dataframe by levels of a factor and name dataframes by those levels

You can do it with the `plyr` package

``require(plyr)dlply(df, .(Z))``

## Splitting data frame into segments for each factor based on a cutoff value in a column in R

In `data.table`:

``dt[, V1 := paste0("A.", 1+cumsum(V4 >= 0.4))]``

In `dplyr`:

``df %>%  mutate(V1 = paste0("A.", 1+cumsum(V4 >= 0.4)))``

## Split a data.frame into smaller data.frames, based on the start and end indices (held in two vectors) and using a condition

Based on the answer to initial comment regarding row indices and using a similar 3-part approach like @Roland, the following should be what you want.

This creates a generic function to return all rows from "start" to "end" (assuming the provided elements are integers)

``split_data <- function( start, end, dfr ){  dfr[start:end,]}``

This creates a list of ALL available splits.

``split.frames <- mapply(split_data,START,END,MoreArgs=list(dfr=ALL_DATA))``

This returns a logical vector with the `i`th element equal to TRUE if the `i`th split meets the desired condition.

``cond <- sapply( split.frames, function(x){sum(x\$Value)>=2} )``

This returns only the splits that meet the condition.

``split.frames <- split.frames[cond]``

EDIT #1

Per the comment about saving off the splits, it is probably best to use the `str_pad()` function from the R package `stringr` for creating the file names, but here is a base R implementation that should work for you.

``nchars <- nchar( length(split.frames) )print.expr <- paste0("%0",nchars,"d")for( i in 1:seq_along(split.frames) ){  file.i <- paste0( sprintf(print.expr,i), ".dat" )  write.table( split.frames[[i]], file=file.i, sep="\t", row.names=FALSE )}``

Not sure if you want column and/or row names in your saved outputs, but I assumed they were YES and NO respectively.

## Split a data frame based on some criteria

if you want to have in the first rows the values that are in vec, you can create two data frames, one that corresponds to the values in vec and one where they're not.

Then, you can concatenate them with rbind:

``in.vec = DF\$col2 %in% vecnew.DF = rbind(DF[in.vec,], DF[!in.vec]) ``

where `DF[in.vec,]` selected the rows for which the values of col2 can be found in `vec`.

## Subsetting a data.frame based on factor levels in a second data.frame

`df.1[,unique(df.2\$Var[which(df.2\$Info=="X1")])]`

``           A            C1  0.8924861 0.71494908542  0.5711894 0.72008195173  0.7049629 0.00040520174  0.9188677 0.50073027175  0.3440664 0.91382598186  0.8657903 0.27240150177  0.7631228 0.56860339068  0.8388003 0.73770641639  0.0796059 0.619669304510 0.5029824 0.8717568610``

## Split/subset a data frame by factors in one column

We could use `split`:

``mylist <- split(df, df\$State)mylist\$AL  ID Rate State1  1   24    AL4  4   34    AL\$FL  ID Rate State3  3   46    FL6  6   99    FL\$MN  ID Rate State2  2   35    MN5  5   78    MN``

To access elements number:

``mylist[]``

or by name:

``mylist\$AL  ID Rate State1  1   24    AL4  4   34    AL``

`?split`

Description

split divides the data in the vector x into the groups defined by f.
The replacement forms replace values corresponding to such a division.
unsplit reverses the effect of split.

## Splitting a data frame based on character string

The main idea is to create a factor used to define the grouping for splitting. One way is by extracting the digits pattern form the provided variable `Barcode` using regular expression. Then we convert the obtained character vector of digits to a factor with `as.factor()`.
We can, of course, use other regular expression techniques to get the job done, or more user friendly wrapper functions from the `stringr` package, like in the second example (the `tidyverse`-ish approach).

### Example 1

A base R solution using `split`:

``# The provided dataBarcode <- c("ABCD-1", "ABCC-1", "ABCD-2", "ABCC-2", "ABCD-3", "ABCC-3",              "ABCD-4", "ABCC-4", "ABCD-5", "ABCC-5","ABCD-6", "ABCC-6")bar_f <- data.frame(Barcode)factor_for_split <- regmatches(x = bar_f\$Barcode,                               m = regexpr(pattern = "[[:digit:]]",                                           text = bar_f\$Barcode))factor_for_split#>   "1" "1" "2" "2" "3" "3" "4" "4" "5" "5" "6" "6"# Create a list of 6 data frames as askedlst <- split(x = bar_f, f = as.factor(factor_for_split))lst#> \$`1`#>   Barcode#> 1  ABCD-1#> 2  ABCC-1#> #> \$`2`#>   Barcode#> 3  ABCD-2#> 4  ABCC-2#> #> \$`3`#>   Barcode#> 5  ABCD-3#> 6  ABCC-3#> #> \$`4`#>   Barcode#> 7  ABCD-4#> 8  ABCC-4#> #> \$`5`#>    Barcode#> 9   ABCD-5#> 10  ABCC-5#> #> \$`6`#>    Barcode#> 11  ABCD-6#> 12  ABCC-6# Edit names of the listnames(lst) <- paste0("df_", names(lst))# Assign each data frame from the list to a data frame object in the global# environmentfor(name in names(lst)) {  assign(name, lst[[name]])}``

Created on 2020-02-24 by the reprex package (v0.3.0)

### Example 2

And, if you prefer, here is a `tidyverse`-ish approach:

``library(dplyr)#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #>     filter, lag#> The following objects are masked from 'package:base':#> #>     intersect, setdiff, setequal, unionlibrary(stringr)Barcode <- c("ABCD-1", "ABCC-1", "ABCD-2", "ABCC-2", "ABCD-3", "ABCC-3",              "ABCD-4", "ABCC-4", "ABCD-5", "ABCC-5","ABCD-6", "ABCC-6")bar_f <- data.frame(Barcode)bar_f %>%   mutate(factor_for_split = str_extract(string = Barcode,                                        pattern = "[[:digit:]]")) %>%   group_split(factor_for_split)#> []#> # A tibble: 2 x 2#>   Barcode factor_for_split#>   <fct>   <chr>           #> 1 ABCD-1  1               #> 2 ABCC-1  1               #> #> []#> # A tibble: 2 x 2#>   Barcode factor_for_split#>   <fct>   <chr>           #> 1 ABCD-2  2               #> 2 ABCC-2  2               #> #> []#> # A tibble: 2 x 2#>   Barcode factor_for_split#>   <fct>   <chr>           #> 1 ABCD-3  3               #> 2 ABCC-3  3               #> #> []#> # A tibble: 2 x 2#>   Barcode factor_for_split#>   <fct>   <chr>           #> 1 ABCD-4  4               #> 2 ABCC-4  4               #> #> []#> # A tibble: 2 x 2#>   Barcode factor_for_split#>   <fct>   <chr>           #> 1 ABCD-5  5               #> 2 ABCC-5  5               #> #> []#> # A tibble: 2 x 2#>   Barcode factor_for_split#>   <fct>   <chr>           #> 1 ABCD-6  6               #> 2 ABCC-6  6               #> #> attr(,"ptype")#> # A tibble: 0 x 2#> # ... with 2 variables: Barcode <fct>, factor_for_split <chr>names(lst) <- paste0("df_", 1:length(lst))for(name in names(lst)) {  assign(name, lst[[name]])``

Created on 2020-02-24 by the reprex package (v0.3.0)