Dplyr Piping Data - Difference Between '.' and '.X'

dplyr piping data - difference between `.` and `.x`

The . is the basic unit of transfer for the magrittr pipelines (which dplyr imports). It contains the value coming from the pipe.

The .x value is something that the tidyverse world added. It's used then you have anonymous functions created with the ~ (tilde) syntax. This calls rlang::as_function to turn that formula into a function. It's basically a short cut so rather than having to type out function(x) x+5, you can just write ~.x+5. Since functions can have more than one parameter, it can be helpful to use names for that parameter so .x refers to the first parameter (and .y the second). The as_function also allows you to use . as an alias for the first parameter. It can do this because the ~ creates a formula and magrittr doesn't generally replaces . in formulas so the mapper is free to re-interpret the .. You can see the function signature here

f <- rlang::as_function(~.x+5)
f
# <lambda>
# function (..., .x = ..1, .y = ..2, . = ..1)
# .x + 5
# attr(,"class")
# [1] "rlang_lambda_function"

You can see how both . and .x are alias for ..1 which is the first parameter passed to the function.

What is the difference between . and .data?

Up front, I think .data's intent is a little confusing until one also considers its sibling pronoun, .env.

The dot . is something that magrittr::%>% sets up and uses; since dplyr re-exports it, it's there. And whenever you reference it, it is a real object, so names(.), nrow(.), etc all work as expected. It does reflect data up to this point in the pipeline.

.data, on the other hand, is defined within rlang for the purpose of disambiguating symbol resolution. Along with .env, it allows you to be perfectly clear on where you want a particular symbol resolved (when ambiguity is expected). From ?.data, I think this is a clarifying contrast:

disp <- 10
mtcars %>% mutate(disp = .data$disp * .env$disp)
mtcars %>% mutate(disp = disp * disp)

However, as stated in the help pages, .data (and .env) is just a "pronoun" (we have verbs, so now we have pronouns too), so it is just a pointer to explain to the tidy internals where the symbol should be resolved. It's just a hint of sorts.

So your statement

both . and .data just mean "our result up to this point in the pipeline."

is not correct: . represents the data up to this point, .data is just a declarative hint to the internals.


Consider another way of thinking about .data: let's say we have two functions that completely disambiguate the environment a symbol is referenced against:

  • get_internally, this symbol must always reference a column name, it will not reach out to the enclosing environment if the column does not exist; and
  • get_externally, this symbol must always reference a variable/object in the enclosing environment, it will never match a column.

In that case, translating the above examples, one might use

disp <- 10
mtcars %>%
mutate(disp = get_internally(disp) * get_externally(disp))

In that case, it seems more obvious that get_internally is not a frame, so you can't call names(get_internally) and expect it to do something meaningful (other than NULL). It'd be like names(mutate).

So don't think of .data as an object, think of it as a mechanism to disambiguate the environment of the symbol. I think the $ it uses is both terse/easy-to-use and absolutely-misleading: it is not a list-like or environment-like object, even if it is being treated as such.

BTW: one can write any S3 method for $ that makes any classed-object look like a frame/environment:

`$.quux` <- function(x, nm) paste0("hello, ", nm, "!")
obj <- structure(0, class = "quux")
obj$r2evans
# [1] "hello, r2evans!"
names(obj)
# NULL

(The presence of a $ accessor does not always mean the object is a frame/env.)

R combinations with dot ( . ), ~ , and pipe (% %) operator

That line uses the . in three different ways.

         [1]             [2]      [3]
aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2))

Generally speaking you pass in the value from the pipe into your function at a specific location with . but there are some exceptions. One exception is when the . is in a formula. The ~ is used to create formulas in R. The pipe wont change the meaning of the formula, so it behaves like it would without any escaping. For example

aggregate(. ~ cyl, data=mydata)

And that's just because aggregate requires a formula with both a left and right hand side. So the . at [1] just means "all the other columns in the dataset." This use is not at all related to magrittr.

The . at [2] is the value that's being passed in as the pipe. If you have a plain . as a parameter to the function, that's there the value will be placed. So the result of the subset() will go to the data= parameter.

The magrittr library also allows you to define anonymous functions with the . variable. If you have a chain that starts with a ., it's treated like a function. so

. %>% mean %>% round(2)

is the same as

function(x) round(mean(x), 2)

so you're just creating a custom function with the . at [3]

Taking the difference between ntiles and then bind_rows in dplyr pipe

This does not need any pre-processing.

library(dplyr)

df %>%
group_by(date) %>%
filter(ntile %in% c(1,5)) %>%
arrange(ntile) %>%
summarise(ntile = paste(ntile[1], ntile[n()], sep = "-"),
score = score[1] - score[n()]) %>%
bind_rows({df %>% mutate(ntile = as.character(ntile))}, .) %>%
select(date, ntile, score)
# # A tibble: 474 x 3
# date ntile score
# <date> <chr> <dbl>
# 1 2005-08-31 1 -2.39
# 2 2005-09-30 1 0.573
# 3 2005-10-31 1 -1.61
# 4 2005-11-30 1 5.43
# 5 2005-12-31 1 0.106
# 6 2006-01-31 1 6.66
# 7 2006-02-28 1 0.613
# 8 2006-03-31 1 4.21
# 9 2006-04-30 1 0.107
# 10 2006-05-31 1 -3.62
# # ... with 464 more rows

This is the tail of data showing df$ntile == '1' - df$ntile == '5' appended to the bottom:

.Last.value %>% tail %>% as.data.frame  

# date ntile score
# 1 2018-07-31 1-5 -0.278
# 2 2018-08-31 1-5 -2.01
# 3 2018-09-30 1-5 0.307
# 4 2018-10-31 1-5 -1.36
# 5 2018-11-30 1-5 -1.33
# 6 2018-12-31 1-5 -1.44

What does the dplyr period character . reference?

The dot is used within dplyr mainly (not exclusively) in mutate_each, summarise_each and do. In the first two (and their SE counterparts) it refers to all the columns to which the functions in funs are applied. In do it refers to the (potentially grouped) data.frame so you can reference single columns by using .$xyz to reference a column named "xyz".

The reasons you cannot run

filter(df, . == 5)

is because a) filter is not designed to work with multiple columns like mutate_each for example and b) you would need to use the pipe operator %>% (originally from magrittr).

However, you could use it with a function like rowSums inside filter when combined with the pipe operator %>%:

> filter(mtcars, rowSums(. > 5) > 4)
Error: Objekt '.' not found

> mtcars %>% filter(rowSums(. > 5) > 4) %>% head()
lm cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4

You should also take a look at the magrittr help files:

library(magrittr)
help("%>%")

From the help page:

Placing lhs elsewhere in rhs call
Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).

Using the dot for secondary purposes
Often, some attribute or property of lhs is desired in the rhs call in addition to the value of lhs itself, e.g. the number of rows or columns. It is perfectly valid to use the dot placeholder several times in the rhs call, but by design
the behavior is slightly different when using it inside nested
function calls. In particular, if the placeholder is only used in a
nested function call, lhs will also be placed as the first argument!
The reason for this is that in most use-cases this produces the most
readable code. For example, iris %>% subset(1:nrow(.) %% 2 == 0) is
equivalent to iris %>% subset(., 1:nrow(.) %% 2 == 0) but slightly
more compact. It is possible to overrule this behavior by enclosing
the rhs in braces. For example, 1:10 %>% {c(min(.), max(.))} is
equivalent to c(min(1:10), max(1:10)).

What is the difference between % % and %,% in magrittr?

The normal piping operator is %>%. You can use %,% to create a reusable pipe, a pipe without data. Then later you can use the same pipe with various data sets. Here is an example.

library(magrittr)
library(dplyr)
library(Lahman)

Suppose you want to calculate the top 5 baseball players, according to total hits. Then you can do something like this (taken from the magrittr README):

Batting %>%
group_by(playerID) %>%
summarise(total = sum(G)) %>%
arrange(desc(total)) %>%
head(5)
# Source: local data frame [5 x 2]
#
# playerID total
# 1 rosepe01 3562
# 2 yastrca01 3308
# 3 aaronha01 3298
# 4 henderi01 3081
# 5 cobbty01 3035

So far so good. Now let's assume that you have several data sets in the same format as Batting, so you could just reuse the same pipe again. %,% helps you create, save and reuse the pipe:

top_total <- group_by(playerID) %,%
summarise(total = sum(G)) %,%
arrange(desc(total)) %,%
head(5)

top_total(Batting)
# Source: local data frame [5 x 2]
#
# playerID total
# 1 rosepe01 3562
# 2 yastrca01 3308
# 3 aaronha01 3298
# 4 henderi01 3081
# 5 cobbty01 3035

Of course you could also create a function the regular R way, i.e. top_total <- function(...) ..., but %,% is a more concise way.

Chain arithmetic operators in dplyr with % % pipe

Surround the operators with backticks or quotes, and things should work as expected:

1:10 %>%  `*`(2) %>% sum
# [1] 110

1:10 %>% `/`(2) %>% sum
# [1] 27.5

Creating new variable using piping in R

For most numbers of cases, it is impossible for any portion of them to be exactly 32%. For instance what we would report 29 of 90 cases as "32%" but that's really 32.222222 which is not strictly equal to 32. So you will need to specify what range around 32 counts as a match. Here, I say anything within 0.5 of 32 on either side, from 31.5 to 32.5, is close enough.

COVID <- COVID %>%
mutate(confirmed_delta_perc = active_delta/active * 100) %>%
filter(abs(confirmed_delta_perc - 32) <= 0.5)


Related Topics



Leave a reply



Submit