Meaning of Tilde and Dot Notation in Dplyr

Tilde Dot in R (~.)

As MrFlick pointed out, these are two separate operators. Together, they provide a special mechanism that allows tidyverse packages to construct lambda functions on the fly. This is best described in ?purrr::as_mapper. Specifically,

If a formula, e.g. ~ .x + 2, it is converted to a function. There are three ways to refer to the arguments:

  • For a single argument function, use .

  • For a two argument function, use .x and .y

  • For more arguments, use ..1, ..2, ..3 etc

Using your example:

purrr::as_mapper( ~. > 5 )
# <lambda>
# function (..., .x = ..1, .y = ..2, . = ..1)
# . > 5
# attr(,"class")
# [1] "rlang_lambda_function"

creates a function that returns a logical value indicating whether the argument of the function is greater than 5. purrr::detect() generates this function internally and then uses it to traverse the input vector x. The final result is the first element of x that satisfies the "greater than 5" constraint.

As pointed out by Konrad, this mechanism is specific to tidyverse and does not work in general. Outside of tidyverse, the behavior of this syntax is explained in a related question.

Meaning of ~. (tilde dot) argument?

This is a formula, in a shorthand notation. Try this:

plot( mpg ~ cyl, data= mtcars )

The left hand is the dependent variable, the right hand is the independent variable. Much like y = bx + c means that y ~ x.

Formulas are one of the corner stones of R, and you will need to understand them to use R efficiently. Most frequently, formulas are used in modeling of all sorts, for example you can do basic linear regression with

lm( mpg ~ wt, data= mtcars )

...to see how mileage per gallon depend on weight. Take a look at ?formula for some more explanations.

The dot means "any columns from data that are otherwise not used". Google for "R formulas" to get more information.

Use of Tilde (~) and period (.) in R

This overall is known as tidyverse non-standard evaluation (NSE). You probably found out that ~ also is used in formulas to indicate that the left hand side is dependent on the right hand side.

In tidyverse NSE, ~ indicates function(...). Thus, these two expressions are equivalent.

x %>% detect(function(...) ..1 > 5)
#[1] 6

x %>% detect(~.x > 5)
#[1] 6

~ automatically assigns each argument of the function to the .; .x, .y; and ..1, ..2 ..3 special symbols. Note that only the first argument becomes ..

map2(1, 2, function(x,y) x + y)
#[[1]]
#[1] 3

map2(1, 2, ~.x + .y)
#[[1]]
#[1] 3

map2(1, 2, ~..1 + ..2)
#[[1]]
#[1] 3

map2(1, 2, ~. + ..2)
#[[1]]
#[1] 3

map2(1, 2, ~. + .[2])
#[[1]]
#[1] NA

This automatic assignment can be very helpful when there are many variables.

mtcars %>% pmap_dbl(~ ..1/..4)
# [1] 0.19090909 0.19090909 0.24516129 0.19454545 0.10685714 0.17238095 0.05836735 0.39354839 0.24000000 0.15609756
#[11] 0.14471545 0.09111111 0.09611111 0.08444444 0.05073171 0.04837209 0.06391304 0.49090909 0.58461538 0.52153846
#[21] 0.22164948 0.10333333 0.10133333 0.05428571 0.10971429 0.41363636 0.28571429 0.26902655 0.05984848 0.11257143
#[31] 0.04477612 0.19633028

But in addition to all of the special symbols I noted above, the arguments are also assigned to .... Just like all of R, ... is sort of like a named list of arguments, so you can use it along with with:

mtcars %>% pmap_dbl(~ with(list(...), mpg/hp))
# [1] 0.19090909 0.19090909 0.24516129 0.19454545 0.10685714 0.17238095 0.05836735 0.39354839 0.24000000 0.15609756
#[11] 0.14471545 0.09111111 0.09611111 0.08444444 0.05073171 0.04837209 0.06391304 0.49090909 0.58461538 0.52153846
#[21] 0.22164948 0.10333333 0.10133333 0.05428571 0.10971429 0.41363636 0.28571429 0.26902655 0.05984848 0.11257143
#[31] 0.04477612 0.19633028

An other way to think about why this works is because data.frames are just a list with some row names:

a <- list(a = c(1,2), b = c("A","B"))
a
#$a
#[1] 1 2
#$b
#[1] "A" "B"
attr(a,"row.names") <- as.character(c(1,2))
class(a) <- "data.frame"
a
# a b
#1 1 A
#2 2 B

What is meaning of first tilde in purrr::map

As per the map help documentation, map needs a function but it also accepts a formula, character vector, numeric vector, or list, the latter of which are converted to functions.

The ~ operator in R creates formula. So ~ lm(mpg ~ wt, data = .) is a formula. Formulas are useful in R because they prevent immediate evaluation of symbols. For example you can define

x <- ~f(a+b)

without f, a or b being defined anywhere. In this case ~ lm(mpg ~ wt, data = .) is basically a shortcut for function(x) {lm(mpg ~ wt, data = x)} because map can change the value of . in the formula as needed.

Without the tilde, lm(mpg ~ wt, data = .) is just an expression or call in R that's evaluated immediately. The . wouldn't be defined at the time that's called and map can't convert that into a function.

You can turn these formulas into functions outside of the map() with purrr::as_mapper() function. For example

myfun <- as_mapper(~lm(mpg ~ wt, data = .))
myfun(mtcars)
# Call:
# lm(formula = mpg ~ wt, data = .)
#
# Coefficients:
# (Intercept) wt
# 37.285 -5.344

myfun
# <lambda>
# function (..., .x = ..1, .y = ..2, . = ..1)
# lm(mpg ~ wt, data = .)
# attr(,"class")
# [1] "rlang_lambda_function"

You can see how the . becomes the first parameter that's passed to that function.

What does the dplyr period character . reference?

The dot is used within dplyr mainly (not exclusively) in mutate_each, summarise_each and do. In the first two (and their SE counterparts) it refers to all the columns to which the functions in funs are applied. In do it refers to the (potentially grouped) data.frame so you can reference single columns by using .$xyz to reference a column named "xyz".

The reasons you cannot run

filter(df, . == 5)

is because a) filter is not designed to work with multiple columns like mutate_each for example and b) you would need to use the pipe operator %>% (originally from magrittr).

However, you could use it with a function like rowSums inside filter when combined with the pipe operator %>%:

> filter(mtcars, rowSums(. > 5) > 4)
Error: Objekt '.' not found

> mtcars %>% filter(rowSums(. > 5) > 4) %>% head()
lm cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4

You should also take a look at the magrittr help files:

library(magrittr)
help("%>%")

From the help page:

Placing lhs elsewhere in rhs call
Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).

Using the dot for secondary purposes
Often, some attribute or property of lhs is desired in the rhs call in addition to the value of lhs itself, e.g. the number of rows or columns. It is perfectly valid to use the dot placeholder several times in the rhs call, but by design
the behavior is slightly different when using it inside nested
function calls. In particular, if the placeholder is only used in a
nested function call, lhs will also be placed as the first argument!
The reason for this is that in most use-cases this produces the most
readable code. For example, iris %>% subset(1:nrow(.) %% 2 == 0) is
equivalent to iris %>% subset(., 1:nrow(.) %% 2 == 0) but slightly
more compact. It is possible to overrule this behavior by enclosing
the rhs in braces. For example, 1:10 %>% {c(min(.), max(.))} is
equivalent to c(min(1:10), max(1:10)).

Use of ~ (tilde) in R programming Language

The thing on the right of <- is a formula object. It is often used to denote a statistical model, where the thing on the left of the ~ is the response and the things on the right of the ~ are the explanatory variables. So in English you'd say something like "Species depends on Sepal Length, Sepal Width, Petal Length and Petal Width".

The myFormula <- part of that line stores the formula in an object called myFormula so you can use it in other parts of your R code.


Other common uses of formula objects in R

The lattice package uses them to specify the variables to plot.

The ggplot2 package uses them to specify panels for plotting.

The dplyr package uses them for non-standard evaulation.

In map(), when is it necessary to use a tilde and a period. (~ and .)

The quick answer to your question is, it is never necessary to use the tilde notation when calling map. There are different ways of calling map and the tilde notation is one of them. You already described the simpelst way of calling map, when a function only takes/needs one argument.

df %>% map_dbl(mean)

However, when functions get more complex there are basically two ways to call them either with the tilde notation or with a normal anonymous function.

# normal anonymous function
models <- mtcars %>%
split(.$cyl) %>%
map(function(x) lm(mpg ~ wt, data = x))

# anonymous mapper function (~)
models <- mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .))

The tilde notation is basically turning a formula into a function, which is most times easier to read. Each option can be turned into a named function, which works as follows. Ideally, the named function reduces the complexity of the underlying function to one argument (the one which should be looped over) and in this case the function can be called like all simple functions in map without further arguments/notations.

# normal named function notation 
lm_mpg_wt <- function(x) {
lm(mpg ~ wt, data = x)
}

models <- mtcars %>%
split(.$cyl) %>%
map(lm_mpg_wt)

# named mapper function
mapper_lm_mpg_wt <- as_mapper(~ lm(mpg ~ wt, data = .))

models <- mtcars %>%
split(.$cyl) %>%
map(mapper_lm_mpg_wt)

Basically these are your options. You should choose whatever is easiest and most fit to your problem. Named functions are best, if you need them again. Many think that mapper functions are easier to read, but at the end of the day that is a choice of personal preference.

R combinations with dot (.), ~, and pipe (%%) operator

That line uses the . in three different ways.

         [1]             [2]      [3]
aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2))

Generally speaking you pass in the value from the pipe into your function at a specific location with . but there are some exceptions. One exception is when the . is in a formula. The ~ is used to create formulas in R. The pipe wont change the meaning of the formula, so it behaves like it would without any escaping. For example

aggregate(. ~ cyl, data=mydata)

And that's just because aggregate requires a formula with both a left and right hand side. So the . at [1] just means "all the other columns in the dataset." This use is not at all related to magrittr.

The . at [2] is the value that's being passed in as the pipe. If you have a plain . as a parameter to the function, that's there the value will be placed. So the result of the subset() will go to the data= parameter.

The magrittr library also allows you to define anonymous functions with the . variable. If you have a chain that starts with a ., it's treated like a function. so

. %>% mean %>% round(2)

is the same as

function(x) round(mean(x), 2)

so you're just creating a custom function with the . at [3]



Related Topics



Leave a reply



Submit