Difference Between Pull and Select in Dplyr

Difference between pull and select in dplyr?

You could see select as an analogue of [ or magrittr::extract and pull as an analogue of [[ (or $) or magrittr::extract2 for data frames (an analogue of [[ for lists would be purr::pluck).

df <- iris %>% head

All of these give the same output:

df %>% pull(Sepal.Length)
df %>% pull("Sepal.Length")
a <- "Sepal.Length"; df %>% pull(!!quo(a))
df %>% extract2("Sepal.Length")
df %>% `[[`("Sepal.Length")
df[["Sepal.Length"]]

# all of them:
# [1] 5.1 4.9 4.7 4.6 5.0 5.4

And all of these give the same output:

df %>% select(Sepal.Length)
a <- "Sepal.Length"; df %>% select(!!quo(a))
df %>% select("Sepal.Length")
df %>% extract("Sepal.Length")
df %>% `[`("Sepal.Length")
df["Sepal.Length"]
# all of them:
#   Sepal.Length
# 1          5.1
# 2          4.9
# 3          4.7
# 4          4.6
# 5          5.0
# 6          5.4

pull and select can take literal, character, or numeric indices, while the others take character or numeric only

One important thing is they differ on how they handle negative indices.

For select negative indices mean columns to drop.

For pull they mean count from last column.

df %>% pull(-Sepal.Length)
df %>% pull(-1)
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

Strange result but Sepal.Length is converted to 1, and column -1 is Species (last column)

This feature is not supported by [[ and extract2 :

df %>% `[[`(-1)
df %>% extract2(-1)
df[[-1]]
# Error in .subset2(x, i, exact = exact) : 
#   attempt to select more than one element in get1index <real>

Negative indices to drop columns are supported by [ and extract though.

df %>% select(-Sepal.Length)
df %>% select(-1)
df %>% `[`(-1)
df[-1]

#   Sepal.Width Petal.Length Petal.Width Species
# 1         3.5          1.4         0.2  setosa
# 2         3.0          1.4         0.2  setosa
# 3         3.2          1.3         0.2  setosa
# 4         3.1          1.5         0.2  setosa
# 5         3.6          1.4         0.2  setosa
# 6         3.9          1.7         0.4  setosa

Difference between 'select' and '$' in R

In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.

Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
dplyr returns a different (more complex) object.
Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.

This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:

library(dplyr)

microbenchmark::microbenchmark(
  base1 = mtcars$cyl, # returns a vector
  base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
  base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
  base3 = mtcars[,"cyl"], # returns a vector
  base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
  dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
  dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
  dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
  dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
##    expr     min       lq       mean   median        uq      max neval
##   base1   4.682   6.3860    9.23727   7.7125   10.6050   25.397   100
##   base2   4.224   5.9905    9.53136   7.7590   11.1095   27.329   100
##  base2a   3.710   5.5380    7.92479   7.0845   10.1045   16.026   100
##   base3   6.312  10.9935   13.99914  13.1740   16.2715   37.765   100
##   base4  51.084  70.3740   92.03134  76.7350   95.9365  662.395   100
##  dplyr1 698.954 742.9615  978.71306 784.8050 1154.6750 3568.188   100
##  dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388   100
##  dplyr3  64.299  78.3745  126.97205  85.3110  112.1000 2383.731   100
##  dplyr4  63.235  73.0450   99.28021  85.1080  114.8465  263.219   100

But, what if we have alot of columns:

# Make a wider version of mtcars
do.call(
  cbind.data.frame,
  lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols

# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
  base1 = mtcars_manycols$cyl_4, # returns a vector
  base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
  base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
  base3 = mtcars_manycols[,"cyl_4"], # returns a vector
  base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
  dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
  dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
  dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
  dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
##    expr      min        lq       mean    median        uq       max neval
##   base1    4.534    6.8535   12.15802    8.7865   13.1775    75.095   100
##   base2    4.150    6.5390   11.59937    9.3005   13.2220    73.332   100
##  base2a    3.904    5.9755   10.73095    7.5820   11.2715    61.687   100
##   base3    6.255   11.5270   16.42439   13.6385   18.6910    70.106   100
##   base4   66.175   89.8560  118.37694   99.6480  122.9650   340.653   100
##  dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705  9354.698   100
##  dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716   100
##  dplyr3  124.295  142.9535  216.89692  166.7115  209.1550  1138.368   100
##  dplyr4  127.280  150.0575  195.21398  169.5285  209.0480   488.199   100

For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.

NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.

What's the difference between using select + unlist from dplyr package and using the dollar sign?

(Posting comments as community wiki.)

These are not quite equivalent - unlist(select(.)) keeps (probably unwanted) names.

dd <- data.frame(Col1=c("abc","def"))
str(unlist(select(dd,Col1)))
##  Factor w/ 2 levels "abc","def": 1 2
##  - attr(*, "names")= chr [1:2] "Col11" "Col12"
str(dd$Col1)
##  Factor w/ 2 levels "abc","def": 1 2

Your instructor is probably just a fan of the tidyverse (@RichScriven); pull(Dat, Col1) or (for extreme "tidiness") Dat %>% pull(Col1) would be more idiomatic (@Henrik). Dat$Col1 or Dat[["Col1"]] would be the base-R equivalents (the former is more convenient for interactive use, the latter is marginally safer for programming purposes since it won't do name-completion).

It hardly matters, but the tidyverse approaches are much slower.

microbenchmark(dd$Col1,dd[["Col1"]],pull(dd,Col1),unlist(select(dd,Col1)))
Unit: microseconds
                     expr     min        lq       mean    median       uq
                  dd$Col1   5.296   10.9630   14.86871   13.4040   17.160
             dd[["Col1"]]   7.870    9.6535   15.18874   11.8270   16.635
           pull(dd, Col1)  44.160  108.7625  128.89342  117.8415  136.890
 unlist(select(dd, Col1)) 601.480 1132.8240 1436.44178 1214.4420 1378.141
      max neval cld
   31.036   100  a 
   88.842   100  a 
  422.462   100  a 
 8796.964   100   b

Pull items from a list and output to a dataframe in R

You can use map_dfc rather than map_df, as it will bind the columns.

library(tidyverse)

map_dfc(mylist, select, 2) %>% 
   head()

#                  cyl...1 cyl...2 cyl...3
#Mazda RX4               6      12      18
#Mazda RX4 Wag           6      12      18
#Datsun 710              4       8      12
#Hornet 4 Drive          6      12      18
#Hornet Sportabout       8      16      24
#Valiant                 6      12      18

Also, if we want to assign a name (e.g., add a sequential number for each column), then we could use map2_dfc. You could also pass a different set of names.

map2_dfc(mylist,
         1:length(mylist),
         \(x, y) x %>% select(2) %>% rename(!!paste0(names(.)[1], y, sep = "") := 1)) %>%
  head()

#                  cyl1 cyl2 cyl3
#Mazda RX4            6   12   18
#Mazda RX4 Wag        6   12   18
#Datsun 710           4    8   12
#Hornet 4 Drive       6   12   18
#Hornet Sportabout    8   16   24
#Valiant              6   12   18

conditional select from different sources in dplyr

I think left_join is more efficient and easier to understand in this case. df3 is the final output.

library(tidyverse)
library(lubridate)
set.seed(123)
df1 <- tibble(Date=seq(as.Date("1990/1/1"), as.Date("1999/12/31"), "days")) %>%
  mutate(Year = year(Date)) %>%
  mutate(DOY = yday(Date)) %>%
  group_by(Year) %>%
  mutate(x = cumsum(runif(n())))

df2 <- tibble(Year = seq(1990,1999),
              DOY = c(101,93,94,95,88,100,102,200,301,34))

df3 <- df2 %>%
  left_join(df1, by = c("Year", "DOY")) %>%
  select(-Date)

df3
# # A tibble: 10 x 3
#    Year   DOY     x
#    <dbl> <dbl> <dbl>
#  1  1990   101  50.5
#  2  1991    93  45.4
#  3  1992    94  44.8
#  4  1993    95  47.2
#  5  1994    88  45.7
#  6  1995   100  52.2
#  7  1996   102  49.8
#  8  1997   200  96.1
#  9  1998   301 148. 
# 10  1999    34  14.1

dplyr::pull with a bare or quoted string as an argument?

pull calls select_var which uses quasiquotation to evaluate or not the argument and eventually return a column name from the data. This allows the column to be specified in a flexible way that supports both interactive and programming use.

a <- "cyl"
select_var(names(mtcars), a)
[1] "cyl"
pull(mtcars,a)
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

Difference Between Pull and Select in Dplyr