Difference Between [] and $ Operators for Subsetting

Difference between [] and $ operators for subsetting

Below we will use the one-row data frame in order to provide briefer output:

mtcars1 <- mtcars[1, ]

Note the differences among these. We can use class as in class(mtcars["hp"]) to investigate the class of the return value.

The first two correspond to the code in the question and return a data frame and plain vector respectively. The key differences between [ and $ are that [ (1) can specify multiple columns, (2) allows passing of a variable as the index and (3) returns a data frame (although see examples later on) whereas $ (1) can only specify a single column, (2) the index must be hard coded and (3) it returns a vector.

mtcars1["hp"]  # returns data frame
##            hp
## Mazda RX4 110

mtcars1$hp # returns plain vector
## [1] 110

Other examples where index is a single element. Note that the first and second examples below are actually the same as drop = TRUE is the default.

mtcars1[, "hp"] # returns plain vector
## [1] 110  

mtcars1[, "hp", drop = TRUE] # returns plain vector
## [1] 110

mtcars1[, "hp", drop = FALSE] # returns data frame
##            hp
## Mazda RX4 110

Also there is the [[ operator which is like the $ operator except it can accept a variable as the index whereas $ requires the index to be hard coded:

mtcars1[["hp"]] # returns plain vector
## [1] 110

Others where index specifies multiple elements. $ and [[ cannot be used with multiple elements so these examples only use [:

mtcars1[c("mpg", "hp")] # returns data frame
##           mpg  hp
## Mazda RX4  21 110

mtcars1[, c("mpg", "hp")] # returns data frame
##           mpg  hp
## Mazda RX4  21 110

mtcars1[, c("mpg", "hp"), drop = FALSE] # returns data frame
##           mpg  hp
## Mazda RX4  21 110

mtcars1[, c("mpg", "hp"), drop = TRUE] # returns list
## $mpg
## [1] 21
## 
## $hp
## [1] 110

[

mtcars[foo] can return more than one column if foo is a vector with more than one element, e.g. mtcars[c("hp", "mpg")], and in all cases the return value is a data.frame even if foo has only one element (as it does in the question).

There is also mtcars[, foo, drop = FALSE] which returns the same value as mtcars[foo] so it always returns a data frame. With drop = TRUE it will return a list rather than a data.frame in the case that foo specifies multiple columns and returns the column itself if it specifies a single column.

[[

On the other hand mtcars[[foo]] only works if foo has one element and it returns that column, not a data frame.

$

mtcars$hp also only works for a single column, like [[, and returns the column, not a data frame containing that column.

mtcars$hp is like mtcars[["hp"]]; however, there is no possibility to pass a variable index with $. One can only hard-code the index with $.

subset

Note that this works:

subset(mtcars, hp > 150)

returning a data frame containing those rows where the hp column exceeds 150:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

other objects

The above pertain to data frames but other objects that can use $, [ and [[ will have their own rules. In particular if m is a matrix, e.g. m <- as.matrix(BOD), then m[, 1] is a vector, not a one column matrix, but m[, 1, drop = FALSE] is a one column matrix. m[[1]] and m[1] are both the first element of m, not the first column. m$a does not work at all.

help

See ?Extract for more information. Also ?"$", ?"[" and ?"[[" all get to the same page, as well.

Difference between subset and filter from dplyr

They are, indeed, producing the same result, and they are very similar in concept.

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

So in terms of human time, I don't think there's much difference between the two.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

library(dplyr)
library(microbenchmark)

# Original example
microbenchmark(
  df1<-subset(airquality, Temp>80 & Month > 5),
  df2<-filter(airquality, Temp>80 & Month > 5)
)

Unit: microseconds
   expr     min       lq     mean   median      uq      max neval cld
 subset  95.598 107.7670 118.5236 119.9370 125.949  167.443   100  a 
 filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997   100   b

# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows

microbenchmark(
  df1<-subset(air, Temp>80 & Month > 5),
  df2<-filter(air, Temp>80 & Month > 5)
)

Unit: microseconds
   expr      min        lq     mean   median       uq      max neval cld
 subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392   100   b
 filter  968.586  985.4475 1056.686 1023.862 1036.765 2489.644   100  a 

# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows

microbenchmark(
  df1<-subset(air, Temp>80 & Month > 5),
  df2<-filter(air, Temp>80 & Month > 5)
)

Unit: milliseconds
   expr       min        lq     mean    median        uq      max neval cld
 subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659   100   b
 filter  5.046148  5.169164 10.27829  5.387484  6.738167 65.38937   100  a

Using multiple criteria in subset function and logical operators

The correct operator is %in% here. Here is an example with dummy data:

set.seed(1)
dat <- data.frame(bf11 = sample(4, 10, replace = TRUE),
                  foo = runif(10))

giving:

> head(dat)
  bf11       foo
1    2 0.2059746
2    2 0.1765568
3    3 0.6870228
4    4 0.3841037
5    1 0.7698414
6    4 0.4976992

The subset of dat where bf11 equals any of the set 1,2,3 is taken as follows using %in%:

> subset(dat, subset = bf11 %in% c(1,2,3))
   bf11       foo
1     2 0.2059746
2     2 0.1765568
3     3 0.6870228
5     1 0.7698414
8     3 0.9919061
9     3 0.3800352
10    1 0.7774452

As to why your original didn't work, break it down to see the problem. Look at what 1||2||3 evaluates to:

> 1 || 2 || 3
[1] TRUE

and you'd get the same using | instead. As a result, the subset() call would only return rows where bf11 was TRUE (or something that evaluated to TRUE).

What you could have written would have been something like:

subset(dat, subset = bf11 == 1 | bf11 == 2 | bf11 == 3)

Which gives the same result as my earlier subset() call. The point is that you need a series of single comparisons, not a comparison of a series of options. But as you can see, %in% is far more useful and less verbose in such circumstances. Notice also that I have to use | as I want to compare each element of bf11 against 1, 2, and 3, in turn. Compare:

> with(dat, bf11 == 1 || bf11 == 2)
[1] TRUE
> with(dat, bf11 == 1 | bf11 == 2)
 [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe

The R Language Definition is handy for answering these types of questions:

http://cran.r-project.org/doc/manuals/R-lang.html#Indexing

R has three basic indexing operators, with syntax displayed by the following examples
    x[i]
    x[i, j]
    x[[i]]
    x[[i, j]]
    x$a
    x$"a"
For vectors and matrices the [[ forms are rarely used, although they have some slight semantic differences from the [ form (e.g. it drops any names or dimnames attribute, and that partial matching is used for character indices). When indexing multi-dimensional structures with a single index, x[[i]] or x[i] will return the ith sequential element of x.

For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.

The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.

R gotcha: logical-and operator for combining conditions is & not &&

From the help page for Logical Operators, accessible by ?"&&":

& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.

(R version 2.13-0)

In other words, when using subset, use the single &.

Here is an illustration of the difference:

c(1,1,0,0) & c(1,0,1,0)
[1]  TRUE FALSE FALSE FALSE

c(1,1,0,0) && c(1,0,1,0)
[1] TRUE

If this looks quirky compared to other programming paradigms, remember that R needs to provide a vectorised form of the operator.

Difference between 'select' and '$' in R

In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.

Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
dplyr returns a different (more complex) object.
Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.

This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:

library(dplyr)

microbenchmark::microbenchmark(
  base1 = mtcars$cyl, # returns a vector
  base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
  base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
  base3 = mtcars[,"cyl"], # returns a vector
  base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
  dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
  dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
  dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
  dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
##    expr     min       lq       mean   median        uq      max neval
##   base1   4.682   6.3860    9.23727   7.7125   10.6050   25.397   100
##   base2   4.224   5.9905    9.53136   7.7590   11.1095   27.329   100
##  base2a   3.710   5.5380    7.92479   7.0845   10.1045   16.026   100
##   base3   6.312  10.9935   13.99914  13.1740   16.2715   37.765   100
##   base4  51.084  70.3740   92.03134  76.7350   95.9365  662.395   100
##  dplyr1 698.954 742.9615  978.71306 784.8050 1154.6750 3568.188   100
##  dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388   100
##  dplyr3  64.299  78.3745  126.97205  85.3110  112.1000 2383.731   100
##  dplyr4  63.235  73.0450   99.28021  85.1080  114.8465  263.219   100

But, what if we have alot of columns:

# Make a wider version of mtcars
do.call(
  cbind.data.frame,
  lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols

# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
  base1 = mtcars_manycols$cyl_4, # returns a vector
  base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
  base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
  base3 = mtcars_manycols[,"cyl_4"], # returns a vector
  base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
  dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
  dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
  dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
  dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
##    expr      min        lq       mean    median        uq       max neval
##   base1    4.534    6.8535   12.15802    8.7865   13.1775    75.095   100
##   base2    4.150    6.5390   11.59937    9.3005   13.2220    73.332   100
##  base2a    3.904    5.9755   10.73095    7.5820   11.2715    61.687   100
##   base3    6.255   11.5270   16.42439   13.6385   18.6910    70.106   100
##   base4   66.175   89.8560  118.37694   99.6480  122.9650   340.653   100
##  dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705  9354.698   100
##  dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716   100
##  dplyr3  124.295  142.9535  216.89692  166.7115  209.1550  1138.368   100
##  dplyr4  127.280  150.0575  195.21398  169.5285  209.0480   488.199   100

For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.

NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.

Difference Between [] and $ Operators for Subsetting