What is the algorithm behind R core's `split` function?
How does split.data.frame
work?
function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...),
function(ind) x[ind, , drop = FALSE])
It calls split.default
to split row index vector seq_len(nrow(x))
, then use an lapply
loop to extract associated rows into a list entry.
This isn't strictly a "data.frame" method. It splits any 2-dimensional objects by the 1st dimension, including splitting a matrix by rows.
How does split.default
work?
function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...)
{
if (!missing(...))
.NotYetUsed(deparse(...), error = FALSE)
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f))
f <- as.factor(f)
else if (drop)
f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class")))
return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}
- if
x
has no classes (i.e., mostly an atomic vector),.Internal(split(x, f))
is used; - otherwise, it uses
.Internal(split())
to split the index alongx
, then uses afor
loop to extract associated elements into a list entry.
An atomic vector (see ?vector
) is a vector with the following mode:
- "logical", "integer", "numeric", "complex", "character" and "raw"
- "list"
- "expression"
An object with class... Er... there are so many!! Let me just give three examples:
- "factor"
- "data.frame"
- "matrix"
In my opinion the split.default
is not well written. There are so many objects with classes, yet split.default
would deal with them in the same way via"["
. This works fine with "factor" and "data.frame" (so we will be splitting data frame along the columns!), but it definitely does not work with a matrix in a way we expect.
A <- matrix(1:9, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
split.default(A, c(1, 1, 2)) ## it does not split the matrix by columns!
#$`1`
#[1] 1 2 4 5 7 8
#
#$`2`
#[1] 3 6 9
Actually recycling rule has been applied to c(1, 1, 2)
, and we are equivalently doing:
split(c(A), rep_len(c(1,1,2), length(A)))
Why doesn't R core write another line for a "matrix", like
for (k in lf) y[[k]] <- x[, ind[[k]], drop = FALSE]
Till now the only way to safely split a matrix by columns is to transpose it, then split.data.frame
, then another transpose.
lapply(split.data.frame(t(A), c(1, 1, 2)), t)
Another workaround via lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix)
is buggy if A
is a character matrix.
How does .Internal(split(x, f))
work?
This is really the core of the core! I will take a small example below for explanation:
set.seed(0)
f <- sample(factor(letters[1:3]), 10, TRUE)
# [1] c a b b c a c c b b
#Levels: a b c
x <- 0:9
Basically there are 3 steps. To enhance readability, Equivalent R code are provided for each step.
step 1: tabulation (counting occurrence of each factor level)
## a factor has integer mode so `tabulate` works
tab <- tabulate(f, nbins = nlevels(f))
[1] 2 4 4
step 2: storage allocation of the resulting list
result <- vector("list", nlevels(f))
for (i in 1:length(tab)) result[[i]] <- vector(mode(x), tab[i])
names(result) <- levels(f)
I would annotate this list as follows, where each line is a list element which is a vector in this example, and each [ ]
is a placeholder for an entry of that vector.
$a: [ ] [ ]
$b: [ ] [ ] [ ] [ ]
$c: [ ] [ ] [ ] [ ]
step 3: element allocation
Now it is useful to uncover the internal integer mode for a factor:
.f <- as.integer(f)
#[1] 3 1 2 2 3 1 3 3 2 2
We need to scan x
and .f
, filling x[i]
into the right entry of result[[.f[i]]]
, informed by an accumulator buffer vector.
ab <- integer(nlevels(f)) ## accumulator buffer
for (i in 1:length(.f)) {
fi <- .f[i]
counter <- ab[fi] + 1L
result[[fi]][counter] <- x[i]
ab[fi] <- counter
}
In the following illustration, ^
is a pointer to elements that are accessed or updated.
## i = 1
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [0] [0] [0] ## on entry
^
$a: [ ] [ ]
$b: [ ] [ ] [ ] [ ]
$c: [0] [ ] [ ] [ ]
^
ab: [0] [0] [1] ## on exit
^
## i = 2
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [0] [0] [1] ## on entry
^
$a: [1] [ ]
^
$b: [ ] [ ] [ ] [ ]
$c: [0] [ ] [ ] [ ]
ab: [1] [0] [1] ## on exit
^
## i = 3
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [0] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [ ] [ ] [ ]
^
$c: [0] [ ] [ ] [ ]
ab: [1] [1] [1] ## on exit
^
## i = 4
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [1] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [3] [ ] [ ]
^
$c: [0] [ ] [ ] [ ]
ab: [1] [2] [1] ## on exit
^
## i = 5
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [2] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [ ] [ ]
^
ab: [1] [2] [2] ## on exit
^
## i = 6
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [2] [2] ## on entry
^
$a: [1] [5]
^
$b: [2] [3] [ ] [ ]
$c: [0] [4] [ ] [ ]
ab: [2] [2] [2] ## on exit
^
## i = 7
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [2] ## on entry
^
$a: [1] [5]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [6] [ ]
^
ab: [2] [2] [3] ## on exit
^
## i = 8
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [3] ## on entry
^
$a: [1] [5]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [6] [7]
^
ab: [2] [2] [4] ## on exit
^
## i = 9
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [4] ## on entry
^
$a: [1] [5]
$b: [2] [3] [8] [ ]
^
$c: [0] [4] [6] [7]
ab: [2] [3] [4] ## on exit
^
## i = 10
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [3] [4] ## on entry
^
$a: [1] [5]
$b: [2] [3] [8] [9]
^
$c: [0] [4] [6] [7]
ab: [2] [4] [4] ## on exit
^
Why split() in R split matrix into vector and how can I get the matrix result?
split.data.frame(x,idx)
maybe? That will force the split
operation to treat your matrix
like a data.frame
, instead of as a vector
with dimensions (which essentially describes a matrix
).
Example showing it gives essentially the same result, but with a matrix
instead of data.frame
returned:
set.seed(1)
x <- matrix(rnorm(15),5,3)
idx <- rbinom(5,1,0.5)
split.data.frame(x,idx)
#$`0`
# [,1] [,2] [,3]
#[1,] -0.6264538 -0.8204684 1.5117812
#[2,] -0.8356286 0.7383247 -0.6212406
#[3,] 1.5952808 0.5757814 -2.2146999
#
#$`1`
# [,1] [,2] [,3]
#[1,] 0.1836433 0.4874291 0.3898432
#[2,] 0.3295078 -0.3053884 1.1249309
split(data.frame(x),idx)
#$`0`
# X1 X2 X3
#1 -0.6264538 -0.8204684 1.5117812
#3 -0.8356286 0.7383247 -0.6212406
#4 1.5952808 0.5757814 -2.2146999
#
#$`1`
# X1 X2 X3
#2 0.1836433 0.4874291 0.3898432
#5 0.3295078 -0.3053884 1.1249309
R - How to split a data frame into a list of data frames with specific header combinations
You want each column to be an individual data frame?
lapply(2:ncol(df), function (j) df[c(1, j)])
The solution with split
is doing no good here. If you want to split up every single column, the algorithm that split
does is actually an overhead. Learn more about split
from What is the algorithm behind R core's `split` function?
If you have difficulty understanding the code, do it in two steps.
# define a function
f <- function (j) df[c(1, j)]
## try the function to see that it does
f(2)
f(3)
# use a lapply loop
result <- lapply(2:ncol(df), f)
Splitting a data frame into N subsets with equal number of columns
Using comments by @markus, to use split.default, we can modify the initial code, and change the sampling so we get exactly 50 in each subset,
Making some dummy data,
df <- data.frame(matrix(1:250, ncol = 250))
Then splitting, (we split this way because of this, pointed out by @markus, this is a more safe/robust version)
df2 <- lapply(split.data.frame(t(df), sample(rep(1:5, ncol(df)/5))), t)
A less robust, but more simple option is:
df2 <- split.default(df, sample(rep(1:5, ncol(df)/5)))
gives us,
> ncol(df2$`1`)
[1] 50
> ncol(df2$`2`)
[1] 50
> ncol(df2$`3`)
[1] 50
> ncol(df2$`4`)
[1] 50
> ncol(df2$`5`)
[1] 50
Splitting a dataframe according to a sequence
The main thing you need to do here is use split.default
instead of split
, as the data.frame
method for split
will split by rows instead of columns. The following algorithm will produce a data frame where each column is the average of the (n, n + m, n + 2 * m + ... + k * m
) etc. columns, where in you case m
is 365, k
is 22, and n
belongs to 1:365
.
df.split <- split.default(df, rep(1:m, ncol(df) / m))
as.data.frame(lapply(df.split, apply, 1, mean, na.rm=T))
This assumes your data frame has a multiple of m
columns. In your case m
is 365, and your data frame does have a multiple of those. And here is some data I made up to test it:
set.seed(1)
m <- 5 # 365 in your case
k <- 3 # 22 in your case (8030 / 365)
df <- as.data.frame(replicate(k * m, sample(1:100, 10), simplify=F))
names(df) <- paste0("V", 1:(k * m))
df[[1]][[5]] <- NA
split matrix in R by column name
It depends a bit what exactly you want to do. Here are a few examples:
mat <- structure(c(3L, 4L, 3L, 4L, 3L, 4L, 3L, 2L, 3L, 2L, 3L, 2L),
.Dim = c(2L,6L),
.Dimnames = list(c("2", "4"), c("c_1", "c_2", "A_1", "A_2","D_1", "D_2")))
If you just want to extract some rows mannually, you can use
mat[,1:2]
mat[,3:4]
mat[,5:6]
In case you want to do this depending on the first letter of the columnname, you can manually choose what column names you want:
mat[,substr(colnames(mat), 1, 1)=="A"]
or you can get a list with all possible columnnames
lst <- lapply(unique(substr(colnames(mat),1,1)),
function(x) mat[,substr(colnames(mat), 1, 1)==x])
names(lst) <- unique(substr(colnames(mat),1,1))
lst
Group dataframe by using a row in r
Check this solution:
library(tidyverse)
df %>%
t() %>%
as_tibble() %>%
split(.$V1) %>%
map(t)
Split into list of data frames by column index
It would be the default
method of split
out <- split.default(x, indx)
identical(ls, setNames(out, NULL))
#[1] TRUE
Related Topics
Insert Picture/Table in R Markdown
R Command for Setting Working Directory to Source File Location in Rstudio
File Path Issues in R Using Windows ("Hex Digits in Character String" Error)
Sending Email in R via Outlook
Read.CSV Warning 'Eof Within Quoted String' Prevents Complete Reading of File
Apply a Function to Every Row of a Matrix or a Data Frame
Shiny: Differencebetween Observeevent and Eventreactive
How to Overlay Density Plots in R
How to Specify the Actual X Axis Values to Plot as X Axis Ticks in R
How to Detect the Right Encoding for Read.Csv
Prevent Row Names to Be Written to File When Using Write.Csv
How to Add a Ggplot2 Subtitle with Different Size and Colour
Ggplot2: Facet_Wrap Strip Color Based on Variable in Data Set
How to Add Table of Contents in Rmarkdown
Is There a Better Alternative Than String Manipulation to Programmatically Build Formulas