What Is the Algorithm Behind R Core's 'Split' Function

What is the algorithm behind R core's `split` function?

How does `split.data.frame` work?

function (x, f, drop = FALSE, ...) 
lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), 
       function(ind) x[ind, , drop = FALSE])

It calls split.default to split row index vector seq_len(nrow(x)), then use an lapply loop to extract associated rows into a list entry.

This isn't strictly a "data.frame" method. It splits any 2-dimensional objects by the 1st dimension, including splitting a matrix by rows.

How does `split.default` work?

function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...) 
{
if (!missing(...)) 
    .NotYetUsed(deparse(...), error = FALSE)
if (is.list(f)) 
    f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f)) 
    f <- as.factor(f)
else if (drop) 
    f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class"))) 
    return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}

if x has no classes (i.e., mostly an atomic vector), .Internal(split(x, f)) is used;
otherwise, it uses .Internal(split()) to split the index along x, then uses a for loop to extract associated elements into a list entry.

An atomic vector (see ?vector) is a vector with the following mode:

"logical", "integer", "numeric", "complex", "character" and "raw"
"list"
"expression"

An object with class... Er... there are so many!! Let me just give three examples:

"factor"
"data.frame"
"matrix"

In my opinion the split.default is not well written. There are so many objects with classes, yet split.default would deal with them in the same way via"[". This works fine with "factor" and "data.frame" (so we will be splitting data frame along the columns!), but it definitely does not work with a matrix in a way we expect.

A <- matrix(1:9, 3)
#     [,1] [,2] [,3]
#[1,]    1    4    7
#[2,]    2    5    8
#[3,]    3    6    9

split.default(A, c(1, 1, 2))  ## it does not split the matrix by columns!
#$`1`
#[1] 1 2 4 5 7 8
#
#$`2`
#[1] 3 6 9

Actually recycling rule has been applied to c(1, 1, 2), and we are equivalently doing:

split(c(A), rep_len(c(1,1,2), length(A)))

Why doesn't R core write another line for a "matrix", like

for (k in lf) y[[k]] <- x[, ind[[k]], drop = FALSE]

Till now the only way to safely split a matrix by columns is to transpose it, then split.data.frame, then another transpose.

lapply(split.data.frame(t(A), c(1, 1, 2)), t)

Another workaround via lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix) is buggy if A is a character matrix.

How does `.Internal(split(x, f))` work?

This is really the core of the core! I will take a small example below for explanation:

set.seed(0)
f <- sample(factor(letters[1:3]), 10, TRUE)
# [1] c a b b c a c c b b
#Levels: a b c

x <- 0:9

Basically there are 3 steps. To enhance readability, Equivalent R code are provided for each step.

step 1: tabulation (counting occurrence of each factor level)

## a factor has integer mode so `tabulate` works
tab <- tabulate(f, nbins = nlevels(f))
[1] 2 4 4

step 2: storage allocation of the resulting list

result <- vector("list", nlevels(f))
for (i in 1:length(tab)) result[[i]] <- vector(mode(x), tab[i])
names(result) <- levels(f)

I would annotate this list as follows, where each line is a list element which is a vector in this example, and each [ ] is a placeholder for an entry of that vector.

$a: [ ] [ ]

$b: [ ] [ ] [ ] [ ]

$c: [ ] [ ] [ ] [ ]

step 3: element allocation

Now it is useful to uncover the internal integer mode for a factor:

.f <- as.integer(f)
#[1] 3 1 2 2 3 1 3 3 2 2

We need to scan x and .f, filling x[i] into the right entry of result[[.f[i]]], informed by an accumulator buffer vector.

ab <- integer(nlevels(f))  ## accumulator buffer

for (i in 1:length(.f)) {
  fi <- .f[i] 
  counter <- ab[fi] + 1L
  result[[fi]][counter] <- x[i]
  ab[fi] <- counter
  }

In the following illustration, ^ is a pointer to elements that are accessed or updated.

## i = 1

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
     ^

ab: [0] [0] [0]  ## on entry
             ^

$a: [ ] [ ]

$b: [ ] [ ] [ ] [ ]

$c: [0] [ ] [ ] [ ]
     ^

ab: [0] [0] [1]  ## on exit
             ^

## i = 2

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
         ^

ab: [0] [0] [1]  ## on entry
     ^

$a: [1] [ ]
     ^
$b: [ ] [ ] [ ] [ ]

$c: [0] [ ] [ ] [ ]

ab: [1] [0] [1]  ## on exit
     ^

## i = 3

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
             ^

ab: [1] [0] [1]  ## on entry
         ^

$a: [1] [ ]

$b: [2] [ ] [ ] [ ]
     ^
$c: [0] [ ] [ ] [ ]

ab: [1] [1] [1]  ## on exit
         ^

## i = 4

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                 ^

ab: [1] [1] [1]  ## on entry
         ^

$a: [1] [ ]

$b: [2] [3] [ ] [ ]
         ^
$c: [0] [ ] [ ] [ ]

ab: [1] [2] [1]  ## on exit
         ^

## i = 5

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                     ^

ab: [1] [2] [1]  ## on entry
             ^

$a: [1] [ ]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [ ] [ ]
         ^

ab: [1] [2] [2]  ## on exit
             ^

## i = 6

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                         ^

ab: [1] [2] [2]  ## on entry
     ^

$a: [1] [5]
         ^
$b: [2] [3] [ ] [ ]

$c: [0] [4] [ ] [ ]

ab: [2] [2] [2]  ## on exit
     ^

## i = 7

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                             ^

ab: [2] [2] [2]  ## on entry
             ^

$a: [1] [5]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [6] [ ]
             ^

ab: [2] [2] [3]  ## on exit
             ^

## i = 8

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                 ^

ab: [2] [2] [3]  ## on entry
             ^

$a: [1] [5]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [6] [7]
                 ^

ab: [2] [2] [4]  ## on exit
             ^

## i = 9

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                     ^

ab: [2] [2] [4]  ## on entry
         ^

$a: [1] [5]

$b: [2] [3] [8] [ ]
             ^
$c: [0] [4] [6] [7]

ab: [2] [3] [4]  ## on exit
         ^

## i = 10

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                         ^

ab: [2] [3] [4]  ## on entry
         ^

$a: [1] [5]

$b: [2] [3] [8] [9]
                 ^
$c: [0] [4] [6] [7]

ab: [2] [4] [4]  ## on exit
         ^

Why split() in R split matrix into vector and how can I get the matrix result?

split.data.frame(x,idx) maybe? That will force the split operation to treat your matrix like a data.frame, instead of as a vector with dimensions (which essentially describes a matrix).

Example showing it gives essentially the same result, but with a matrix instead of data.frame returned:

set.seed(1)
x <- matrix(rnorm(15),5,3)
idx <- rbinom(5,1,0.5)
split.data.frame(x,idx)
#$`0`
#           [,1]       [,2]       [,3]
#[1,] -0.6264538 -0.8204684  1.5117812
#[2,] -0.8356286  0.7383247 -0.6212406
#[3,]  1.5952808  0.5757814 -2.2146999
#
#$`1`
#          [,1]       [,2]      [,3]
#[1,] 0.1836433  0.4874291 0.3898432
#[2,] 0.3295078 -0.3053884 1.1249309

split(data.frame(x),idx)
#$`0`
#          X1         X2         X3
#1 -0.6264538 -0.8204684  1.5117812
#3 -0.8356286  0.7383247 -0.6212406
#4  1.5952808  0.5757814 -2.2146999
#
#$`1`
#         X1         X2        X3
#2 0.1836433  0.4874291 0.3898432
#5 0.3295078 -0.3053884 1.1249309

R - How to split a data frame into a list of data frames with specific header combinations

You want each column to be an individual data frame?

lapply(2:ncol(df), function (j) df[c(1, j)])

The solution with split is doing no good here. If you want to split up every single column, the algorithm that split does is actually an overhead. Learn more about split from What is the algorithm behind R core's `split` function?

If you have difficulty understanding the code, do it in two steps.

# define a function
f <- function (j) df[c(1, j)]

## try the function to see that it does
f(2)
f(3)

# use a lapply loop
result <- lapply(2:ncol(df), f)

Splitting a data frame into N subsets with equal number of columns

Using comments by @markus, to use split.default, we can modify the initial code, and change the sampling so we get exactly 50 in each subset,

Making some dummy data,

df <- data.frame(matrix(1:250, ncol = 250))

Then splitting, (we split this way because of this, pointed out by @markus, this is a more safe/robust version)

df2 <- lapply(split.data.frame(t(df), sample(rep(1:5, ncol(df)/5))), t)

A less robust, but more simple option is:

df2 <- split.default(df, sample(rep(1:5, ncol(df)/5)))

gives us,

> ncol(df2$`1`)
[1] 50
> ncol(df2$`2`)
[1] 50
> ncol(df2$`3`)
[1] 50
> ncol(df2$`4`)
[1] 50
> ncol(df2$`5`)
[1] 50

Splitting a dataframe according to a sequence

The main thing you need to do here is use split.default instead of split, as the data.frame method for split will split by rows instead of columns. The following algorithm will produce a data frame where each column is the average of the (n, n + m, n + 2 * m + ... + k * m) etc. columns, where in you case m is 365, k is 22, and n belongs to 1:365.

df.split <- split.default(df, rep(1:m, ncol(df) / m))
as.data.frame(lapply(df.split, apply, 1, mean, na.rm=T))

This assumes your data frame has a multiple of m columns. In your case m is 365, and your data frame does have a multiple of those. And here is some data I made up to test it:

set.seed(1)
m <- 5 # 365 in your case
k <- 3 # 22 in your case (8030 / 365)
df <- as.data.frame(replicate(k * m, sample(1:100, 10), simplify=F))
names(df) <- paste0("V", 1:(k * m))
df[[1]][[5]] <- NA

split matrix in R by column name

It depends a bit what exactly you want to do. Here are a few examples:

mat <- structure(c(3L, 4L, 3L, 4L, 3L, 4L, 3L, 2L, 3L, 2L, 3L, 2L), 
                 .Dim = c(2L,6L), 
                 .Dimnames = list(c("2", "4"), c("c_1", "c_2", "A_1", "A_2","D_1", "D_2")))

If you just want to extract some rows mannually, you can use

mat[,1:2]
mat[,3:4]
mat[,5:6]

In case you want to do this depending on the first letter of the columnname, you can manually choose what column names you want:

mat[,substr(colnames(mat), 1, 1)=="A"]

or you can get a list with all possible columnnames

lst <- lapply(unique(substr(colnames(mat),1,1)), 
          function(x) mat[,substr(colnames(mat), 1, 1)==x])
names(lst) <- unique(substr(colnames(mat),1,1))
lst

Group dataframe by using a row in r

Check this solution:

library(tidyverse)

df %>%
  t() %>%
  as_tibble() %>%
  split(.$V1) %>%
  map(t)

Split into list of data frames by column index

It would be the default method of split

out <- split.default(x, indx)
identical(ls, setNames(out, NULL))
#[1] TRUE

What Is the Algorithm Behind R Core's 'Split' Function