Difference Between [] and $ Operators for Subsetting

Difference between [] and $ operators for subsetting

Below we will use the one-row data frame in order to provide briefer output:

mtcars1 <- mtcars[1, ]

Note the differences among these. We can use class as in class(mtcars["hp"]) to investigate the class of the return value.

The first two correspond to the code in the question and return a data frame and plain vector respectively. The key differences between [ and $ are that [ (1) can specify multiple columns, (2) allows passing of a variable as the index and (3) returns a data frame (although see examples later on) whereas $ (1) can only specify a single column, (2) the index must be hard coded and (3) it returns a vector.

mtcars1["hp"]  # returns data frame
## hp
## Mazda RX4 110

mtcars1$hp # returns plain vector
## [1] 110

Other examples where index is a single element. Note that the first and second examples below are actually the same as drop = TRUE is the default.

mtcars1[, "hp"] # returns plain vector
## [1] 110

mtcars1[, "hp", drop = TRUE] # returns plain vector
## [1] 110

mtcars1[, "hp", drop = FALSE] # returns data frame
## hp
## Mazda RX4 110

Also there is the [[ operator which is like the $ operator except it can accept a variable as the index whereas $ requires the index to be hard coded:

mtcars1[["hp"]] # returns plain vector
## [1] 110

Others where index specifies multiple elements. $ and [[ cannot be used with multiple elements so these examples only use [:

mtcars1[c("mpg", "hp")] # returns data frame
## mpg hp
## Mazda RX4 21 110

mtcars1[, c("mpg", "hp")] # returns data frame
## mpg hp
## Mazda RX4 21 110

mtcars1[, c("mpg", "hp"), drop = FALSE] # returns data frame
## mpg hp
## Mazda RX4 21 110

mtcars1[, c("mpg", "hp"), drop = TRUE] # returns list
## $mpg
## [1] 21
##
## $hp
## [1] 110

[

mtcars[foo] can return more than one column if foo is a vector with more than one element, e.g. mtcars[c("hp", "mpg")], and in all cases the return value is a data.frame even if foo has only one element (as it does in the question).

There is also mtcars[, foo, drop = FALSE] which returns the same value as mtcars[foo] so it always returns a data frame. With drop = TRUE it will return a list rather than a data.frame in the case that foo specifies multiple columns and returns the column itself if it specifies a single column.

[[

On the other hand mtcars[[foo]] only works if foo has one element and it returns that column, not a data frame.

$

mtcars$hp also only works for a single column, like [[, and returns the column, not a data frame containing that column.

mtcars$hp is like mtcars[["hp"]]; however, there is no possibility to pass a variable index with $. One can only hard-code the index with $.

subset

Note that this works:

subset(mtcars, hp > 150)

returning a data frame containing those rows where the hp column exceeds 150:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

other objects

The above pertain to data frames but other objects that can use $, [ and [[ will have their own rules. In particular if m is a matrix, e.g. m <- as.matrix(BOD), then m[, 1] is a vector, not a one column matrix, but m[, 1, drop = FALSE] is a one column matrix. m[[1]] and m[1] are both the first element of m, not the first column. m$a does not work at all.

help

See ?Extract for more information. Also ?"$", ?"[" and ?"[[" all get to the same page, as well.

Difference between subset and filter from dplyr

They are, indeed, producing the same result, and they are very similar in concept.

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

So in terms of human time, I don't think there's much difference between the two.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

library(dplyr)
library(microbenchmark)

# Original example
microbenchmark(
df1<-subset(airquality, Temp>80 & Month > 5),
df2<-filter(airquality, Temp>80 & Month > 5)
)

Unit: microseconds
expr min lq mean median uq max neval cld
subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a
filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b

# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows

microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)

Unit: microseconds
expr min lq mean median uq max neval cld
subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b
filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a

# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows

microbenchmark(
df1<-subset(air, Temp>80 & Month > 5),
df2<-filter(air, Temp>80 & Month > 5)
)

Unit: milliseconds
expr min lq mean median uq max neval cld
subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b
filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a

Using multiple criteria in subset function and logical operators

The correct operator is %in% here. Here is an example with dummy data:

set.seed(1)
dat <- data.frame(bf11 = sample(4, 10, replace = TRUE),
foo = runif(10))

giving:

> head(dat)
bf11 foo
1 2 0.2059746
2 2 0.1765568
3 3 0.6870228
4 4 0.3841037
5 1 0.7698414
6 4 0.4976992

The subset of dat where bf11 equals any of the set 1,2,3 is taken as follows using %in%:

> subset(dat, subset = bf11 %in% c(1,2,3))
bf11 foo
1 2 0.2059746
2 2 0.1765568
3 3 0.6870228
5 1 0.7698414
8 3 0.9919061
9 3 0.3800352
10 1 0.7774452

As to why your original didn't work, break it down to see the problem. Look at what 1||2||3 evaluates to:

> 1 || 2 || 3
[1] TRUE

and you'd get the same using | instead. As a result, the subset() call would only return rows where bf11 was TRUE (or something that evaluated to TRUE).

What you could have written would have been something like:

subset(dat, subset = bf11 == 1 | bf11 == 2 | bf11 == 3)

Which gives the same result as my earlier subset() call. The point is that you need a series of single comparisons, not a comparison of a series of options. But as you can see, %in% is far more useful and less verbose in such circumstances. Notice also that I have to use | as I want to compare each element of bf11 against 1, 2, and 3, in turn. Compare:

> with(dat, bf11 == 1 || bf11 == 2)
[1] TRUE
> with(dat, bf11 == 1 | bf11 == 2)
[1] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE

The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe

The R Language Definition is handy for answering these types of questions:

  • http://cran.r-project.org/doc/manuals/R-lang.html#Indexing


R has three basic indexing operators, with syntax displayed by the following examples



x[i]
x[i, j]
x[[i]]
x[[i, j]]
x$a
x$"a"


For vectors and matrices the [[ forms are rarely used, although they have some slight semantic differences from the [ form (e.g. it drops any names or dimnames attribute, and that partial matching is used for character indices). When indexing multi-dimensional structures with a single index, x[[i]] or x[i] will return the ith sequential element of x.


For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.


The [[ form allows only a single element to be selected using integer or character indices, whereas [ allows indexing by vectors. Note though that for a list, the index can be a vector and each element of the vector is applied in turn to the list, the selected component, the selected component of that component, and so on. The result is still a single element.

R gotcha: logical-and operator for combining conditions is & not &&

From the help page for Logical Operators, accessible by ?"&&":

& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.

(R version 2.13-0)

In other words, when using subset, use the single &.


Here is an illustration of the difference:

c(1,1,0,0) & c(1,0,1,0)
[1] TRUE FALSE FALSE FALSE

c(1,1,0,0) && c(1,0,1,0)
[1] TRUE

If this looks quirky compared to other programming paradigms, remember that R needs to provide a vectorised form of the operator.

Difference between 'select' and '$' in R

In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.

  • Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
  • dplyr returns a different (more complex) object.
  • Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.

This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:

library(dplyr)

microbenchmark::microbenchmark(
base1 = mtcars$cyl, # returns a vector
base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
base3 = mtcars[,"cyl"], # returns a vector
base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.682 6.3860 9.23727 7.7125 10.6050 25.397 100
## base2 4.224 5.9905 9.53136 7.7590 11.1095 27.329 100
## base2a 3.710 5.5380 7.92479 7.0845 10.1045 16.026 100
## base3 6.312 10.9935 13.99914 13.1740 16.2715 37.765 100
## base4 51.084 70.3740 92.03134 76.7350 95.9365 662.395 100
## dplyr1 698.954 742.9615 978.71306 784.8050 1154.6750 3568.188 100
## dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388 100
## dplyr3 64.299 78.3745 126.97205 85.3110 112.1000 2383.731 100
## dplyr4 63.235 73.0450 99.28021 85.1080 114.8465 263.219 100

But, what if we have alot of columns:

# Make a wider version of mtcars
do.call(
cbind.data.frame,
lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols

# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
base1 = mtcars_manycols$cyl_4, # returns a vector
base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
base3 = mtcars_manycols[,"cyl_4"], # returns a vector
base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.534 6.8535 12.15802 8.7865 13.1775 75.095 100
## base2 4.150 6.5390 11.59937 9.3005 13.2220 73.332 100
## base2a 3.904 5.9755 10.73095 7.5820 11.2715 61.687 100
## base3 6.255 11.5270 16.42439 13.6385 18.6910 70.106 100
## base4 66.175 89.8560 118.37694 99.6480 122.9650 340.653 100
## dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705 9354.698 100
## dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716 100
## dplyr3 124.295 142.9535 216.89692 166.7115 209.1550 1138.368 100
## dplyr4 127.280 150.0575 195.21398 169.5285 209.0480 488.199 100

For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.

NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.



Related Topics



Leave a reply



Submit