How Exactly Does R Parse '->', the Right-Assignment Operator

How exactly does R parse `-`, the right-assignment operator?

Let me preface this by saying I know absolutely nothing about how parsers work. Having said that, line 296 of gram.y defines the following tokens to represent assignment in the (YACC?) parser R uses:

%token      LEFT_ASSIGN EQ_ASSIGN RIGHT_ASSIGN LBB

Then, on lines 5140 through 5150 of gram.c, this looks like the corresponding C code:

case '-':
  if (nextchar('>')) {
    if (nextchar('>')) {
      yylval = install_and_save2("<<-", "->>");
      return RIGHT_ASSIGN;
    }
    else {
      yylval = install_and_save2("<-", "->");
      return RIGHT_ASSIGN;
    }
  }

Finally, starting on line 5044 of gram.c, the definition of install_and_save2:

/* Get an R symbol, and set different yytext.  Used for translation of -> to <-. ->> to <<- */
static SEXP install_and_save2(char * text, char * savetext)
{
    strcpy(yytext, savetext);
    return install(text);
}

So again, having zero experience working with parsers, it seems that -> and ->> are translated directly into <- and <<-, respectively, at a very low level in the interpretation process.

You brought up a very good point in asking how the parser "knows" to reverse the arguments to -> - considering that -> appears to be installed into the R symbol table as <- - and thus be able to correctly interpret x -> y as y <- x and not x <- y. The best I can do is provide further speculation as I continue to come across "evidence" to support my claims. Hopefully some merciful YACC expert will stumble on this question and provide a little insight; I'm not going to hold my breath on that, though.

Back to lines 383 and 384 of gram.y, this looks like some more parsing logic related to the aforementioned LEFT_ASSIGN and RIGHT_ASSIGN symbols:

|   expr LEFT_ASSIGN expr       { $$ = xxbinary($2,$1,$3);  setId( $$, @$); }
|   expr RIGHT_ASSIGN expr      { $$ = xxbinary($2,$3,$1);  setId( $$, @$); }

Although I can't really make heads or tails of this crazy syntax, I did notice that the second and third arguments to xxbinary are swapped to WRT LEFT_ASSIGN (xxbinary($2,$1,$3)) and RIGHT_ASSIGN (xxbinary($2,$3,$1)).

Here's what I'm picturing in my head:

LEFT_ASSIGN Scenario: y <- x

$2 is the second "argument" to the parser in the above expression, i.e. <-
$1 is the first; namely y
$3 is the third; x

Therefore, the resulting (C?) call would be xxbinary(<-, y, x).

Applying this logic to RIGHT_ASSIGN, i.e. x -> y, combined with my earlier conjecture about <- and -> getting swapped,

$2 gets translated from -> to <-
$1 is x
$3 is y

But since the result is xxbinary($2,$3,$1) instead of xxbinary($2,$1,$3), the result is still xxbinary(<-, y, x).

Building off of this a little further, we have the definition of xxbinary on line 3310 of gram.c:

static SEXP xxbinary(SEXP n1, SEXP n2, SEXP n3)
{
    SEXP ans;
    if (GenerateCode)
    PROTECT(ans = lang3(n1, n2, n3));
    else
    PROTECT(ans = R_NilValue);
    UNPROTECT_PTR(n2);
    UNPROTECT_PTR(n3);
    return ans;
}

Unfortunately I could not find a proper definition of lang3 (or its variants lang1, lang2, etc...) in the R source code, but I'm assuming that it is used for evaluating special functions (i.e. symbols) in a way that is synchronized with the interpreter.

Updates
I'll try to address some of your additional questions in the comments as best I can given my (very) limited knowledge of the parsing process.

1) Is this really the only object in R that behaves like this?? (I've
got in mind the John Chambers quote via Hadley's book: "Everything
that exists is an object. Everything that happens is a function call."
This clearly lies outside that domain -- is there anything else like
this?

First, I agree that this lies outside of that domain. I believe Chambers' quote concerns the R Environment, i.e. processes that are all taking place after this low level parsing phase. I'll touch on this a little bit more below, however. Anyways, the only other example of this sort of behavior I could find is the ** operator, which is a synonym for the more common exponentiation operator ^. As with right assignment, ** doesn't seem to be "recognized" as a function call, etc... by the interpreter:

R> `->`
#Error: object '->' not found
R> `**`
#Error: object '**' not found

I found this because it's the only other case where install_and_save2 is used by the C parser:

case '*':
  /* Replace ** by ^.  This has been here since 1998, but is
     undocumented (at least in the obvious places).  It is in
     the index of the Blue Book with a reference to p. 431, the
     help for 'Deprecated'.  S-PLUS 6.2 still allowed this, so
     presumably it was for compatibility with S. */
  if (nextchar('*')) {
    yylval = install_and_save2("^", "**");
    return '^';
  } else
    yylval = install_and_save("*");
return c;

2) When exactly does this happen? I've got in mind that substitute(3
-> y) has already flipped the expression; I couldn't figure out from the source what substitute does that would have pinged the YACC...

Of course I'm still speculating here, but yes, I think we can safely assume that when you call substitute(3 -> y), from the perspective of the substitute function, the expression always was y <- 3; e.g. the function is completely unaware that you typed 3 -> y. do_substitute, like 99% of the C functions used by R, only handles SEXP arguments - an EXPRSXP in the case of 3 -> y (== y <- 3), I believe. This is what I was alluding to above when I made a distinction between the R Environment and the parsing process. I don't think there is anything that specifically triggers the parser to spring into action - but rather everything you input into the interpreter gets parsed. I did a little more reading about the YACC / Bison parser generator last night, and as I understand it (a.k.a. don't bet the farm on this), Bison uses the grammar you define (in the .y file(s)) to generate a parser in C - i.e. a C function which does the actual parsing of input. In turn, everything you input in an R session is first processed by this C parsing function, which then delegates the appropriate action to be taken in the R Environment (I'm using this term very loosely by the way). During this phase, lhs -> rhs will get translated to rhs <- lhs, ** to ^, etc... For example, this is an excerpt from one of the tables of primitive functions in names.c:

/* Language Related Constructs */

/* Primitives */
{"if",      do_if,      0,  200,    -1, {PP_IF,      PREC_FN,     1}},
{"while",   do_while,   0,  100,    2,  {PP_WHILE,   PREC_FN,     0}},
{"for",     do_for,     0,  100,    3,  {PP_FOR,     PREC_FN,     0}},
{"repeat",  do_repeat,  0,  100,    1,  {PP_REPEAT,  PREC_FN,     0}},
{"break",   do_break, CTXT_BREAK,   0,  0,  {PP_BREAK,   PREC_FN,     0}},
{"next",    do_break, CTXT_NEXT,    0,  0,  {PP_NEXT,    PREC_FN,     0}},
{"return",  do_return,  0,  0,  -1, {PP_RETURN,  PREC_FN,     0}},
{"function",    do_function,    0,  0,  -1, {PP_FUNCTION,PREC_FN,     0}},
{"<-",      do_set,     1,  100,    -1, {PP_ASSIGN,  PREC_LEFT,   1}},
{"=",       do_set,     3,  100,    -1, {PP_ASSIGN,  PREC_EQ,     1}},
{"<<-",     do_set,     2,  100,    -1, {PP_ASSIGN2, PREC_LEFT,   1}},
{"{",       do_begin,   0,  200,    -1, {PP_CURLY,   PREC_FN,     0}},
{"(",       do_paren,   0,  1,  1,  {PP_PAREN,   PREC_FN,     0}},

You will notice that ->, ->>, and ** are not defined here. As far as I know, R primitive expressions such as <- and [, etc... are the closest interaction the R Environment ever has with any underlying C code. What I am suggesting is that by this stage in process (from you typing a set characters into the interpreter and hitting 'Enter', up through the actual evaluation of a valid R expression), the parser has already worked its magic, which is why you can't get a function definition for -> or ** by surrounding them with backticks, as you typically can.

Parsing - assignment operator in R

can I be sure that I always get "<-" operator in syntax tree

Let’s see …

> quote(b -> a)
a <- b
> identical(quote(b -> a), quote(a <- b))
[1] TRUE

So yes, the -> assignment is always parsed as <- (the same is not true when invoking -> as a function name!¹).

Your first display is the other way round because of parse’s keep.source argument:

> parse(text = 'b -> a')
expression(b -> a)
> parse(text = 'b -> a', keep.source = FALSE)
expression(a <- b)

¹ Invoking <- as a function is the same as using it as an operator:

> quote(`<-`(a, b))
a <- b
> identical(quote(a <- b), quote(`<-`(a, b)))
[1] TRUE

However, there is no -> function (although you can define one), and writing b -> a never calls a -> function, it always gets parsed as a <- b, which, in turn, invokes the <- function or primitive.

What are the differences between = and - assignment operators?

What are the differences between the assignment operators = and <- in R?

As your example shows, = and <- have slightly different operator precedence (which determines the order of evaluation when they are mixed in the same expression). In fact, ?Syntax in R gives the following operator precedence table, from highest to lowest:

…
‘-> ->>’           rightwards assignment
‘<- <<-’           assignment (right to left)
‘=’                assignment (right to left)
…

But is this the only difference?

Since you were asking about the assignment operators: yes, that is the only difference. However, you would be forgiven for believing otherwise. Even the R documentation of ?assignOps claims that there are more differences:

The operator <- can be used anywhere,
whereas the operator = is only allowed at the top level (e.g.,
in the complete expression typed at the command prompt) or as one
of the subexpressions in a braced list of expressions.

Let’s not put too fine a point on it: the R documentation is wrong. This is easy to show: we just need to find a counter-example of the = operator that isn’t (a) at the top level, nor (b) a subexpression in a braced list of expressions (i.e. {…; …}). — Without further ado:

x
# Error: object 'x' not found
sum((x = 1), 2)
# [1] 3
x
# [1] 1

Clearly we’ve performed an assignment, using =, outside of contexts (a) and (b). So, why has the documentation of a core R language feature been wrong for decades?

It’s because in R’s syntax the symbol = has two distinct meanings that get routinely conflated (even by experts, including in the documentation cited above):

The first meaning is as an assignment operator. This is all we’ve talked about so far.
The second meaning isn’t an operator but rather a syntax token that signals named argument passing in a function call. Unlike the = operator it performs no action at runtime, it merely changes the way an expression is parsed.

So how does R decide whether a given usage of = refers to the operator or to named argument passing? Let’s see.

In any piece of code of the general form …

‹function_name›(‹argname› = ‹value›, …)
‹function_name›(‹args›, ‹argname› = ‹value›, …)

… the = is the token that defines named argument passing: it is not the assignment operator. Furthermore, = is entirely forbidden in some syntactic contexts:

if (‹var› = ‹value›) …
while (‹var› = ‹value›) …
for (‹var› = ‹value› in ‹value2›) …
for (‹var1› in ‹var2› = ‹value›) …

Any of these will raise an error “unexpected '=' in ‹bla›”.

In any other context, = refers to the assignment operator call. In particular, merely putting parentheses around the subexpression makes any of the above (a) valid, and (b) an assignment. For instance, the following performs assignment:

median((x = 1 : 10))

But also:

if (! (nf = length(from))) return()

_{Now you might object that such code is atrocious (and you may be right). But I took this code from the base::file.copy function (replacing <- with =) — it’s a pervasive pattern in much of the core R codebase.}

The original explanation by John Chambers, which the the R documentation is probably based on, actually explains this correctly:

[= assignment is] allowed in only two places in the grammar: at the top level (as a complete program or user-typed expression); and when isolated from surrounding logical structure, by braces or an extra pair of parentheses.

In sum, by default the operators <- and = do the same thing. But either of them can be overridden separately to change its behaviour. By contrast, <- and -> (left-to-right assignment), though syntactically distinct, always call the same function. Overriding one also overrides the other. Knowing this is rarely practical but it can be used for some fun shenanigans.

The Assignment Operator in R: Does - work all the time, in comparison to =?

R has several assignment operators. Per the documentation

The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.

The only place I am aware of where you must use the <- operator is naming items of a list in attach.

This does not work:

> attach(what <- list(foo <- function(x) print(x)))

but this does:

> attach(what <- list(foo = function(x) print(x)))

I don't actually know why this is. If anyone else knows I'd love to learn why.

I am also compelled to discourage use of any of the blasphemous right assignment operators.

R assignment operators

When the R parser comes across a -> b it calls '<-'("b", a) and when it comes across a ->> b it calls '<<-'("b", a)

We can see this explicitly if we do the following:

as.call(quote(a <- 1))
#> a <- 1
as.call(quote(a <<- 1))
#> a <<- 1
as.call(quote(1 -> a))
#> a <- 1
as.call(quote(1 ->> a))
#> a <<- 1

^{Created on 2022-01-27 by the reprex package (v2.0.1)}

Bidirectional assignment operator in R

One can define something like:

`%<->%` <- function(x,y){
    t <- y
    assign(deparse(substitute(y)), x, envir=parent.frame())
    assign(deparse(substitute(x)), t, envir=parent.frame()) 
}
a <- 1
b <- 2
a %<->% b
a
[1] 2
b
[1] 1

Parsing = operator in R does not yield a language object

You need to understand that typeof returns a fairly low level characterization and that is( ... , "language") tests a somewhat higher level of abstraction. There is not much use for typeof. It's generally more useful to ask for the class of an object:

> class(parsed)
[1] "expression"
> class(parsed[[1]])
[1] "="

This second one might seem a bit odd, and I would have thought it to be eitehr a call or and Ops result, but if you look at:

parsed[[1]]
#cylinders = c(4, 6, 8)

You see that the call object is represent internally, i.e. the parse-tree, as:

`=`( cylinders, c(4, 6, 8) )

... noting that:

 parsed[[1]][[1]]
`=`    # note the backticks signifying a function, a language object

... and that this is really a call-object:

  is.call( parsed[[1]] )
 #[1] TRUE

See ?parse where it is explained that the function returns an unevaluated call-object. I'm more of an S3 guy so trying to explain what's going wrong with your S4 stuff is above my pay grade. Notice that the error message from your failed S4 efforts referred to a mismatch of 'class' rather than 'typeof'

How Exactly Does R Parse '->', the Right-Assignment Operator