Why Can't C++ Be Parsed With a Lr(1) Parser

Why can't C++ be parsed with a LR(1) parser?

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous,
context-dependent and potentially
requires infinite lookahead to resolve
some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

The example is:

int(x), y, *const z;

meaning

int x;
int y;
int *const z;

Compare to:

int(x), y, new int;

meaning

(int(x)), (y), (new int));

(a comma-separated expression).

The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.

How does gcc/clang parse c++ if it can't be parsed by a LR(1) parser?

GCC/Clang are not strict recursive descent parsers; they allow backtracking (reparsing an arbitrarily-long text), among other deviations from theoretical purity. Backtracking parsers are strictly more powerful than non-backtracking parsers (but at the cost of no longer being linear time).

The real complexities in parsing C++ vanish when the question is phrased this way. If you strip C++ down to the superset described by the BNF in Appendix A, then you basically just need to be prepared to consider several alternative parses. You can do that through backtracking -- trying one possibility and then if that fails, trying some other ones -- or with parallel exploration, using GLR/GLL or some other variant; nothing too painful. But the real work still needs to be confronted:

the preprocessor, which is not particularly complicated but cannot be described by anything remotely resembling a formal parsing framework;
template instantiation, which intermingles semantic analysis into the parsing process (and needs to be done in order to discover the correct parse);
name resolution, which some might not consider to be part of parsing, but you're not going to move on to the next step until you know which syntactic object a particular identifier refers to. (If you think name resolution is simple, reread the 13 dense pages of the C++ standard in section 6.5 (Name lookup), and then move on to the 35 pages on resolving overloaded names, section 12.)

Can an LR(1) parser parse this grammar?

No, an LR(1) parser cannot parse this grammar. LR(k) parsers can only parse unambiguous grammars, and this grammar is ambiguous (you can derive ε in infinitely many ways).

You can check this by building the configurating sets for the grammar, though that's going to be pretty boring. :-)

Hope this helps!

LR(1) parser state size still an issue?

The main concern with LR(1) parsers is the table size, and that table size is going to hurt in one way or another.

If you have an LR(1) parser with 10,000,000 states (not all that uncommon) where there are, say, 50 nonterminals and 50 terminals (not all that unreasonable), you will have a table with one billion entries in it. If you use even one byte per entry, you now need 1GB of space just to hold the table. That space either is in the application binary, in which case you now have a 1GB executable, or it's generated dynamically, in which case you now need 1GB of RAM plus the time to populate it. Neither of these are very attractive.

You absolutely could use an LR(1) parser if you have that kind of memory, but it wouldn't be a good idea. First, the size of the application binary would be enormous. This would make it difficult to distribute the application. Second, the act of loading the table into memory would require a transfer of about 1GB of data from disk into RAM, which would be extraordinarily slow. There's also the issue of paging in and out the parsing tables. If the OS doesn't do a good job evicting pages, you could end up thrashing, degrading performance unacceptably.

While you could put the parser on a server, this typically isn't done right now and would require that all compilation be done over a network.

There's also the question of whether it's worth it. The huge spike in resource costs from the parser would need to be justified by some proportional benefit in parsing quality. In practice, LALR parsers would work for many grammars. For those that it doesn't work for, newer parsing algorithms like IELR or GLR would be a superior choice to LR(1) because they offer the same parsing power (or more in the case of GLR) with significant space reductions. Consequently, you'd be better off using those algorithms.

In summary, yes, you could use LR(1) today, but it would be so resource inefficient that you'd be better off with another parsing algorithm.

Hope this helps!

How to resolve ambiguity in the definition of an LR(1) grammar?

As written, the go grammar is not LALR(1). In fact, it is not LR(k) for any k. It is, however, unambiguous, so you could successfully parse it with a GLR parser, if you can find one (I'm pretty sure that there are several GLR parser generators for OCAML, but I don't know enough about any of them to recommend one).

If you don't want to (or can't) use a GLR parser, you can do it the same way Russ Cox did in the gccgo compiler, which uses bison. (bison can generate GLR parsers, but Cox doesn't use that feature.) His technique does not rely on the scanner distinguishing between type-names and non-type-names.

Rather, it just accepts parameter lists whose elements are either name_or_type or name name_or_type (actually, there are more possibilities than that, because of the ... syntax, but it doesn't change the general principle.) That's simple, unambiguous and LALR(1), but it is overly-accepting -- it will accept func foo(a, b int, c), for example -- and it does not produce the correct abstract syntax tree because it doesn't attach the type to the list of parameters being declared.

What that means is that once the argument list is fully parsed and is about to be inserted into the AST attached to some function declaration (for example), a semantic scan is performed to fix it up and, if necessary, produce an error message. That scan is done right-to-left over the list of declaration elements, so that the specified type can be propagated to the left.

It's worth noting that the grammar in the reference manual is also overly-accepting, because it does not express the constraint that "either all parameters are named or none are". That constraint could be expressed in an LR(1) grammar -- I'll leave that as an exercise for readers -- but the resulting grammar would be a lot more difficult to understand.

Which grammars can be parsed using recursive descent without backtracking?

TL;DR

For a suitable definition of what a recursive descent parser it, it is absolutely correct that only LL(k) languages can be parsed by recursive descent.

Lua can be parsed with a recursive descent parser precisely because the language is LL(k); that is, an LL(k) grammar exists for Lua. [Note 1]

1. An LL(k) language may have non-LL(k) grammars.

A language is LL(k) if there is an LL(k) grammar which recognizes the language. That doesn't mean that every grammar which recognizes the language is LL(k); there might be any number of non-LL(k) grammars which recognize the language. So the fact that some grammar is not LL(k) says absolutely nothing about the language itself.

2. Many practical programming languages are described with an ambiguous grammar.

In formal language theory, a language is inherently ambiguous only if every grammar for the language is ambiguous. It is probably safe to say that no practical programming language is inherently ambiguous, since practical programming languages are deterministically parsed (somehow). [Note 2].

Because writing a strictly non-ambiguous grammar can be tedious, it is pretty common for the language documentation to provide an ambiguous grammar, along with textual material which indicates how the ambiguities are to be resolved.

For example, many languages (including Lua) are documented with a grammar which does not explicitly include operator precedence, allowing a simple rule for expressions:

exp ::= exp Binop exp | Unop exp | term

That rule is clearly ambiguous, but given a list of operators, their relative precedences and an indication of whether each operator is left- or right-associative, the rule can be mechanically expanded into an unambiguous expression grammar. Indeed, many parser generators allow the user to provide the precedence declarations separately, and perform the mechanical expansion in the course of producing the parser. The resulting parser, it should be noted, is a parser for the disambiguated grammar so the ambiguity of the original grammar does not imply that the parsing algorithm is capable of dealing with ambiguous grammars.

Another common example of ambiguous reference grammars which can be mechanically disambiguated is the "dangling else" ambiguity found in languages like C (but not in Lua). The grammar:

if-statement ::= "if" '(' exp ')' stmt
               | "if" '(' exp ')' stmt "else" stmt

is certainly ambiguous; the intention is that the parse be "greedy". Again, the ambiguity is not inherent. There is a mechanical transformation which produces an unambiguous grammar, something like the following:

matched-statement ::= matched-if-stmt | other-statement
statement         ::= matched-if-stmt | unmatched-if-stmt
matched-if-stmt   ::= "if" '(' exp ')' matched-statement "else" matched-statement 
unmatched-if-stmt ::= "if" '(' exp ')' statement
                    | "if" '(' exp ')' matched-statement "else" unmatched-if-stmt

It is quite common for parser generators to implicitly perform this transformation. (For an LR parser generator, the transformation is actually implemented by deleting reduce actions if they conflict with a shift action. This is simpler than transforming the grammar, but it has exactly the same effect.)

So Lua (and other programming languages) are not inherently ambiguous; and therefore they can be parsed with parsing algorithms which require unambiguous deterministic parsers. Indeed, it might even be a little surprising that there are languages for which every possible grammar is ambiguous. As is pointed out in the Wikipedia article cited above, the existence of such languages was proven by Rohit Parikh in 1961; a simple example of an inherently-ambiguous context-free language is

{aⁿb^mc^mdⁿ|n,m≥0} ∪ {aⁿbⁿc^md^m|n,m≥0}.

3. Greedy LL(1) parsing of Lua assignment and function call statements

As with the dangling else construction above, the disambiguation of Lua statement sequences is performed by only allowing the greedy parse. Intuitively, the procedure is straight-forward; it is based on forbidding two consecutive statements (without intervening semicolon) where the second one starts with a token which might continue the first one.

In practice, it is not really necessary to perform this transformation; it can be done implicitly during the construction of the parser. So I'm not going to bother to generate a complete Lua grammar here. But I trust that the small subset of the Lua grammar here is sufficient to illustrate how the transformation can work.

The following subset (largely based on the reference grammar) exhibits precisely the ambiguity indicated in the OP:

program        ::= statement-list
statement-list ::= Ø
                 | statement-list statement
statement      ::= assignment | function-call | block | ';'
block          ::= "do" statement-list "end"
assignment     ::= var '=' exp
exp            ::= prefixexp                          [Note 3]
prefixexp      ::= var | '(' exp ')' | function-call
var            ::= Name | prefixexp '[' exp ']'
function-call  ::= prefixexp '(' exp ')'

(Note: (I'm using Ø to represent the empty string, rather ε, λ, or %empty.)

The Lua grammar as is left-recursive, so it is clearly not LL(k) (independent of the ambiguity). Removing the left-recursion can be done mechanically; I've done enough of it here in order to demonstrate that the subset is LL(1). Unfortunately, the transformed grammar does not preserve the structure of the parse tree, which is a classic problem with LL(k) grammars. It is usually simple to reconstruct the correct parse tree during a recursive descent parse and I'm not going to go into the details.

It is simple to provide an LL(1) version of exp, but the result eliminates the distinction between var (which can be assigned to) and function-call (which cannot):

exp          ::= term exp-postfix
exp-postfix  ::= Ø
               | '[' exp ']' exp-postfix
               | '(' exp ')' exp-postfix 
term         ::= Name | '(' exp ')'

But now we need to recreate the distinction in order to be able to parse both assignment statements and function calls. That's straight-forward (but does not promote understanding of the syntax, IMHO):

a-or-fc-statement ::= term a-postfix
a-postfix         ::= '=' exp 
                    | ac-postfix
c-postfix         ::= Ø
                    | ac-postfix
ac-postfix        ::= '(' exp ')' c-postfix
                    | '[' exp ']' a-postfix

In order to make the greedy parse unambiguous, we need to ban (from the grammar) any occurrence of S₁ S₂ where S₁ ends with an exp and S₂ starts with a '('. In effect, we need to distinguish different types of statement, depending on whether or not the statement starts with a (, and independently, whether or not the statement ends with an exp. (In practice, there are only three types because there are no statements which start with a ( and do not end with an exp. [Note 4])

statement-list ::= Ø
                 | s1 statement-list
                 | s2 s2-postfix
                 | s3 s2-postfix
s2-postfix     ::= Ø
                 | s1 statement-list
                 | s2 s2-postfix
s1             ::= block | ';'
s2             ::= Name a-postfix
s3             ::= '(' exp ')' a-postfix

4. What is recursive descent parsing, and how can it be modified to incorporate disambiguation?

In the most common usage, a predictive recursive descent parser is an implementation of the LL(k) algorithm in which each non-terminal is mapped to a procedure. Each non-terminal procedure starts by using a table of possible lookahead sequences of length k to decide which alternative production for that non-terminal to use, and then simply "executes" the production symbol by symbol: terminal symbols cause the next input symbol to be discarded if it matches or an error to be reported if it doesn't match; non-terminal symbols cause the non-terminal procedure to be called.

The tables of lookahead sequences can be constructed using FIRST_k and FOLLOW_k sets. (A production A→ω is mapped to a sequence α of terminals if α ∈ FIRST_k(ω FOLLOW_k(A)).) [Note 5]

With this definition of recursive descent parsing, a recursive descent parser can handle precisely and solely LL(k) languages. [Note 6]

However, the alignment of LL(k) and recursive descent parsers ignores an important aspect of a recursive descent parser, which is that it is, first and foremost, a program normally written in some Turing-complete programming language. If that program is allowed to deviate slightly from the rigid structure described above, it could parse a much larger set of languages, even languages which are not context-free. (See, for example, the C context-sensitivity referenced in Note 2.)

In particular, it is very easy to add a "default" rule to a table mapping lookaheads to productions. This is a very tempting optimization because it considerably reduces the size of the lookahead table. Commonly, the default rule is used for non-terminals whose alternatives include an empty right-hand side, which in the case of an LL(1) grammar would be mapped to any symbol in the FOLLOW set for the non-terminal. In that implementation, the lookahead table only includes lookaheads from the FIRST set, and the parser automatically produces an empty right-hand side, corresponding to an immediate return, for any other symbol. (As with the similar optimisation in LR(k) parsers, this optimization can delay recognition of errors but they are still recognized before an additional token is read.)

An LL(1) parser cannot include a nullable non-terminal whose FIRST and FOLLOW sets contain a common element. However, if the recursive descent parser uses the "default rule" optimization, that conflict will never be noticed during the construction of the parser. In effect, ignoring the conflict allows the construction of a "greedy" parser from (certain) non-deterministic grammars.

That's enormously convenient, because as we have seen above producing unambiguous greedy grammars is a lot of work and does not lead to anything even vaguely resembling a clear exposition of the language. But the modified recursive parsing algorithm is not more powerful; it simply parses an equivalent SLL(k) grammar (without actually constructing that grammar).

I do not intend to provide a complete proof of the above assertion, but a first step is to observe that any non-terminal can be rewritten as a disjunction of new non-terminals, each with a single distinct FIRST token, and possibly a new non-terminal with an empty right-hand side. It is then "only" necessary to remove non-terminals from the FOLLOW set of nullable non-terminals by creating new disjunctions.

Notes

Here, I'm talking about the grammar which operates on a tokenized stream, in which comments have been removed and other constructs (such as strings delimited by "long brackets") reduced to a single token. Without this transformation, the language would not be LL(k) (since comments -- which can be arbitrarily long -- interfere with visibility of the lookahead token). This allows me to also sidestep the question of how long brackets can be recognised with an LL(k) grammar, which is not particularly relevant to this question.
There are programming languages which cannot be deterministically parsed by a context-free grammar. The most notorious example is probably Perl, but there is also the well-known C construct (x)*y which can only be parsed deterministically using information about the symbol x -- whether it is a variable name or a type alias -- and the difficulties of correctly parsing C++ expressions involving templates. (See, for example, the questions Why can't C++ be parsed with a LR(1) parser? and Is C++ context-free or context-sensitive?)
For simplicity, I've removed the various literal constants (strings, numbers, booleans, etc.) as well as table constructors and function definitions. These tokens cannot be the target of a function-call, which means that an expression ending with one of these tokens cannot be extended with a parenthesized expression. Removing them simplifies the illustration of disambiguation; the procedure is still possible with the full grammar, but it is even more tedious.
With the full grammar, we will need to also consider expressions which cannot be extended with a (, so there will be four distinct options.
There are deterministic LL(k) grammars which fail to produce unambiguous parsing tables using this algorithm, which Sippu & Soisalon-Soininen call the Strong LL(k) algorithm. It is possible to augment the algorithm using an additional parsing state, similar to the state in an LR(k) parser. This might be convenient for particular grammars but it does not change the definition of LL(k) languages. As Sippu & Soisalon-Soininen demonstrate, it is possible to mechanically derive from any LL(k) grammar an SLL(k) grammar which produces exactly the same language. (See Theorem 8.47 in Volume 2).
The recursive definition algorithm is a precise implementation of the canonical stack-based LL(k) parser, where the parser stack is implicitly constructed during the execution of the parser using the combination of the current continuation and the stack of activation records.

Why Can't C++ Be Parsed With a Lr(1) Parser