Why is `[` better than `subset`?

This question was answered in well in the comments by @James, pointing to an excellent explanation by Hadley Wickham of the dangers of subset (and functions like it) [here]. Go read it!

It's a somewhat long read, so it may be helpful to record here the example that Hadley uses that most directly addresses the question of "what can go wrong?":

Hadley suggests the following example: suppose we want to subset and then reorder a data frame using the following functions:

scramble <- function(x) x[sample(nrow(x)), ]

subscramble <- function(x, condition) {
scramble(subset(x, condition))

subscramble(mtcars, cyl == 4)

This returns the error:

Error in eval(expr, envir, enclos) : object 'cyl' not found

because R no longer "knows" where to find the object called 'cyl'. He also points out the truly bizarre stuff that can happen if by chance there is an object called 'cyl' in the global environment:

cyl <- 4
subscramble(mtcars, cyl == 4)

cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)

(Run them and see for yourself, it's pretty crazy.)

When should I use which for subsetting?

Since this question is specifically about subsetting, I thought I would
illustrate some of the performance benefits of using which() over a
logical subset brought up in the linked question.

When you want to extract the entire subset, there is not much difference in
processing speed, but using which() needs to allocate less memory. However,if you only want a part of the subset (e.g. to showcase some strange
findings), which() has a significant speed and memory advantage due to
being able to avoid subsetting a data frame twice by subsetting the result of
which() instead.

Here are the benchmarks:

df <- ggplot2::diamonds; dim(df)
#> [1] 53940 10
mu <- mean(df$price)

n = c(sum(df$price > mu), 10),
i <- seq_len(n)
logical = df[df$price > mu, ][i, ],
which_1 = df[which(df$price > mu), ][i, ],
which_2 = df[which(df$price > mu)[i], ]
#> Running with:
#> n
#> 1 19657
#> 2 10
#> # A tibble: 6 x 11
#> expression n min mean median max `itr/sec` mem_alloc
#> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 logical 19657 1.5ms 1.81ms 1.71ms 3.39ms 553. 5.5MB
#> 2 which_1 19657 1.41ms 1.61ms 1.56ms 2.41ms 620. 2.89MB
#> 3 which_2 19657 826.56us 934.72us 910.88us 1.41ms 1070. 1.76MB
#> 4 logical 10 893.12us 1.06ms 1.02ms 1.93ms 941. 4.21MB
#> 5 which_1 10 814.4us 944.81us 908.16us 1.78ms 1058. 1.69MB
#> 6 which_2 10 230.72us 264.45us 249.28us 1.08ms 3781. 498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

Created on 2018-08-19 by the reprex package (v0.2.0).

Subsetting in R vs filter(from dplyr) giving different results

If there are NAs make sure to adjust for the NA elements with is.na or else filter by default will remove those rows

filter(house2, (datetime >= "2007-02-01 00:00:00" &
datetime <= "2007-02-03 00:00:00")|

According to ?filter

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.

Difference between subset and filter from dplyr

They are, indeed, producing the same result, and they are very similar in concept.

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

So in terms of human time, I don't think there's much difference between the two.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.


# Original example
df1<-subset(airquality, Temp>80 & Month > 5),
df2<-filter(airquality, Temp>80 & Month > 5)

Unit: microseconds
expr min lq mean median uq max neval cld
subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a
filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b

# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows

df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)

Unit: microseconds
expr min lq mean median uq max neval cld
subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b
filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a

# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows

df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)

Unit: milliseconds
expr min lq mean median uq max neval cld
subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b
filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a

In R subsetting without using subset() and use [ in a more concise manner to prevent typos?

After some thought, I wrote a super simple function called given:

given=function(.,...) { with(.,...) }

This way, I don't have to repeat the name of the data.frame. I also found it to be 14 times faster than filter(). See below:

given=function(.,...) { with(.,...) }
with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)

Using microbenchmark

> adf=data.frame(a=1:10,b=11:20)
> given=function(.,...) { with(.,...) }
> with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
a b
6 6 16
7 7 17
> given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
a b
6 6 16
7 7 17
> dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
a b
1 6 16
2 7 17
> microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
with(adf, adf[a > 5 & b < 18, ]) 47.897 60.441 67.59776 67.284 70.705 361.507 1000
> microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
given(adf, .[a > 5 & b < 18, ]) 48.277 50.558 54.26993 51.698 56.64 272.556 1000
> microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
dplyr::filter(adf, a > 5, b < 18) 524.965 581.2245 748.1818 674.7375 889.7025 7341.521 1000

I noticed that given() is actually a tad faster than with(), due to the length of the variable name.

The neat thing about given, is that you can do some things inline without assignment:
given(data.frame(a=1:10,b=11:20),.[a>5 & b<18,])

Why does lm() with the subset argument give a different answer than subsetting in advance?

tl;dr As suggested in other comments and answers, the characteristics of the orthogonal polynomial basis are computed before the subsetting is taken into account.

To add more technical detail to @JonManes's answer, let's look at lines 545-553 of the R code where 'model.frame' is defined.

First we have (lines 545-549)

 if(is.null(attr(formula, "predvars"))) {
for (i in seq_along(varnames))
predvars[[i+1L]] <- makepredictcall(variables[[i]], vars[[i+1L]])
attr(formula, "predvars") <- predvars
  • At this point in the code, formula will not be an actual formula (that would be too easy!), but rather a terms object that contains various useful-to-developers info about model structures ...
  • predvars is the attribute that defines the information needed to properly reconstruct data-dependent bases like orthogonal polynomials and splines (see ?makepredictcall for a little bit more information, or here, although in general this stuff is really poorly documented; I'd expect it to be documented here but it isn't ...). For example,
attr(terms(model.frame(mpg ~ poly(horsepower, 2), data = auto_train)),  "predvars")


list(mpg, poly(horsepower, 2, coefs = list(alpha = c(102.612244897959, 
142.498828460405), norm2 = c(1, 196, 277254.530612245, 625100662.205702

These are the coefficients for the polynomial, which depend on the distribution of the input data.

Only after this information has been established, on line 553, do we get

subset <- eval(substitute(subset), data, env)

In other words, the subsetting argument doesn't even get evaluated until after the polynomial characteristics are determined (all of this information is then passed to the internal C_modelframe function, which you really don't want to look at ...)

Note that this issue does not result in an information leak between training and testing sets in a statistical learning context: the parameterization of the polynomial doesn't affect the predictions of the model at all (in theory, although as usual with floating point the results are unlikely to be exactly identical). At worst (if the training and full sets were very different) it could reduce numerical stability a bit.

FWIW this is all surprising (to me) and seems worth raising on the r-devel@r-project.org mailing list (at least a note in the documentation seems in order).

