Difference Between Pull and Select in Dplyr

Difference between pull and select in dplyr?

You could see select as an analogue of [ or magrittr::extract and pull as an analogue of [[ (or $) or magrittr::extract2 for data frames (an analogue of [[ for lists would be purr::pluck).

df <- iris %>% head

All of these give the same output:

df %>% pull(Sepal.Length)
df %>% pull("Sepal.Length")
a <- "Sepal.Length"; df %>% pull(!!quo(a))
df %>% extract2("Sepal.Length")
df %>% `[[`("Sepal.Length")
df[["Sepal.Length"]]

# all of them:
# [1] 5.1 4.9 4.7 4.6 5.0 5.4

And all of these give the same output:

df %>% select(Sepal.Length)
a <- "Sepal.Length"; df %>% select(!!quo(a))
df %>% select("Sepal.Length")
df %>% extract("Sepal.Length")
df %>% `[`("Sepal.Length")
df["Sepal.Length"]
# all of them:
# Sepal.Length
# 1 5.1
# 2 4.9
# 3 4.7
# 4 4.6
# 5 5.0
# 6 5.4

pull and select can take literal, character, or numeric indices, while the others take character or numeric only

One important thing is they differ on how they handle negative indices.

For select negative indices mean columns to drop.

For pull they mean count from last column.

df %>% pull(-Sepal.Length)
df %>% pull(-1)
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

Strange result but Sepal.Length is converted to 1, and column -1 is Species (last column)

This feature is not supported by [[ and extract2 :

df %>% `[[`(-1)
df %>% extract2(-1)
df[[-1]]
# Error in .subset2(x, i, exact = exact) :
# attempt to select more than one element in get1index <real>

Negative indices to drop columns are supported by [ and extract though.

df %>% select(-Sepal.Length)
df %>% select(-1)
df %>% `[`(-1)
df[-1]

# Sepal.Width Petal.Length Petal.Width Species
# 1 3.5 1.4 0.2 setosa
# 2 3.0 1.4 0.2 setosa
# 3 3.2 1.3 0.2 setosa
# 4 3.1 1.5 0.2 setosa
# 5 3.6 1.4 0.2 setosa
# 6 3.9 1.7 0.4 setosa

Difference between 'select' and '$' in R

In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.

  • Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
  • dplyr returns a different (more complex) object.
  • Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.

This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:

library(dplyr)

microbenchmark::microbenchmark(
base1 = mtcars$cyl, # returns a vector
base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
base3 = mtcars[,"cyl"], # returns a vector
base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.682 6.3860 9.23727 7.7125 10.6050 25.397 100
## base2 4.224 5.9905 9.53136 7.7590 11.1095 27.329 100
## base2a 3.710 5.5380 7.92479 7.0845 10.1045 16.026 100
## base3 6.312 10.9935 13.99914 13.1740 16.2715 37.765 100
## base4 51.084 70.3740 92.03134 76.7350 95.9365 662.395 100
## dplyr1 698.954 742.9615 978.71306 784.8050 1154.6750 3568.188 100
## dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388 100
## dplyr3 64.299 78.3745 126.97205 85.3110 112.1000 2383.731 100
## dplyr4 63.235 73.0450 99.28021 85.1080 114.8465 263.219 100

But, what if we have alot of columns:

# Make a wider version of mtcars
do.call(
cbind.data.frame,
lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols

# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
base1 = mtcars_manycols$cyl_4, # returns a vector
base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
base3 = mtcars_manycols[,"cyl_4"], # returns a vector
base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.534 6.8535 12.15802 8.7865 13.1775 75.095 100
## base2 4.150 6.5390 11.59937 9.3005 13.2220 73.332 100
## base2a 3.904 5.9755 10.73095 7.5820 11.2715 61.687 100
## base3 6.255 11.5270 16.42439 13.6385 18.6910 70.106 100
## base4 66.175 89.8560 118.37694 99.6480 122.9650 340.653 100
## dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705 9354.698 100
## dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716 100
## dplyr3 124.295 142.9535 216.89692 166.7115 209.1550 1138.368 100
## dplyr4 127.280 150.0575 195.21398 169.5285 209.0480 488.199 100

For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.

NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.

What's the difference between using select + unlist from dplyr package and using the dollar sign?

(Posting comments as community wiki.)

These are not quite equivalent - unlist(select(.)) keeps (probably unwanted) names.

dd <- data.frame(Col1=c("abc","def"))
str(unlist(select(dd,Col1)))
## Factor w/ 2 levels "abc","def": 1 2
## - attr(*, "names")= chr [1:2] "Col11" "Col12"
str(dd$Col1)
## Factor w/ 2 levels "abc","def": 1 2

Your instructor is probably just a fan of the tidyverse (@RichScriven); pull(Dat, Col1) or (for extreme "tidiness") Dat %>% pull(Col1) would be more idiomatic (@Henrik). Dat$Col1 or Dat[["Col1"]] would be the base-R equivalents (the former is more convenient for interactive use, the latter is marginally safer for programming purposes since it won't do name-completion).

It hardly matters, but the tidyverse approaches are much slower.

microbenchmark(dd$Col1,dd[["Col1"]],pull(dd,Col1),unlist(select(dd,Col1)))
Unit: microseconds
expr min lq mean median uq
dd$Col1 5.296 10.9630 14.86871 13.4040 17.160
dd[["Col1"]] 7.870 9.6535 15.18874 11.8270 16.635
pull(dd, Col1) 44.160 108.7625 128.89342 117.8415 136.890
unlist(select(dd, Col1)) 601.480 1132.8240 1436.44178 1214.4420 1378.141
max neval cld
31.036 100 a
88.842 100 a
422.462 100 a
8796.964 100 b

Pull items from a list and output to a dataframe in R

You can use map_dfc rather than map_df, as it will bind the columns.

library(tidyverse)

map_dfc(mylist, select, 2) %>%
head()

# cyl...1 cyl...2 cyl...3
#Mazda RX4 6 12 18
#Mazda RX4 Wag 6 12 18
#Datsun 710 4 8 12
#Hornet 4 Drive 6 12 18
#Hornet Sportabout 8 16 24
#Valiant 6 12 18

Also, if we want to assign a name (e.g., add a sequential number for each column), then we could use map2_dfc. You could also pass a different set of names.

map2_dfc(mylist,
1:length(mylist),
\(x, y) x %>% select(2) %>% rename(!!paste0(names(.)[1], y, sep = "") := 1)) %>%
head()

# cyl1 cyl2 cyl3
#Mazda RX4 6 12 18
#Mazda RX4 Wag 6 12 18
#Datsun 710 4 8 12
#Hornet 4 Drive 6 12 18
#Hornet Sportabout 8 16 24
#Valiant 6 12 18

conditional select from different sources in dplyr

I think left_join is more efficient and easier to understand in this case. df3 is the final output.

library(tidyverse)
library(lubridate)
set.seed(123)
df1 <- tibble(Date=seq(as.Date("1990/1/1"), as.Date("1999/12/31"), "days")) %>%
mutate(Year = year(Date)) %>%
mutate(DOY = yday(Date)) %>%
group_by(Year) %>%
mutate(x = cumsum(runif(n())))

df2 <- tibble(Year = seq(1990,1999),
DOY = c(101,93,94,95,88,100,102,200,301,34))

df3 <- df2 %>%
left_join(df1, by = c("Year", "DOY")) %>%
select(-Date)

df3
# # A tibble: 10 x 3
# Year DOY x
# <dbl> <dbl> <dbl>
# 1 1990 101 50.5
# 2 1991 93 45.4
# 3 1992 94 44.8
# 4 1993 95 47.2
# 5 1994 88 45.7
# 6 1995 100 52.2
# 7 1996 102 49.8
# 8 1997 200 96.1
# 9 1998 301 148.
# 10 1999 34 14.1

dplyr::pull with a bare or quoted string as an argument?

pull calls select_var which uses quasiquotation to evaluate or not the argument and eventually return a column name from the data. This allows the column to be specified in a flexible way that supports both interactive and programming use.

a <- "cyl"
select_var(names(mtcars), a)
[1] "cyl"
pull(mtcars,a)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4


Related Topics



Leave a reply



Submit