C++ Templates Angle Brackets Pitfall - What Is the C++11 Fix

C++ Templates Angle Brackets Pitfall - What is the C++11 fix?

It's fixed by adding a special case to the parsing rules when parsing template arguments.

C++11 14.2/3: When parsing a template-argument-list, the first non-nested > is taken as the ending delimiter rather than a greater-than operator. Similarly, the first non-nested >> is treated as two consecutive but distinct > tokens, the first of which is taken as the end of the template-argument-list and completes the template-id.

What are all the syntax problems introduced by the usage of angle brackets in C++ templates?

but haven't compilers become able to parse that syntax, nowadays?

Of course. But it’s far from trivial. In particular, it prevents you from implementing a clean separation between context-unaware lexer and parser. This is particularly irksome for syntax highlighters and other support tools that need to parse C++, but don’t want/can implement a fully-fledged syntactical analyser.

It makes C++ so much harder to parse that a lot of tools simply won’t bother. This is a net loss for the ecosystem. Put differently: it makes developing a parsing tool much more expensive.

For instance, ctags fails for some template definitions, which makes it unusable with our current C++ project. Very annoying.

But I don't think there that many cases where you need to [distinguish between angle brackets and less-than]

It doesn’t matter how often you need to do this. Your parser still needs to handle this.

D’s decision to drop angle backets was a no-brainer. Any one reason would have sufficed, given that it’s a net benefit.

Why does alias template give a conflicting declaration?

The problem relies on SFINAE. If you rewrite your member function to be value_t<S<T>>, like the outside declaration, then GCC will happily compile it:

template<class T>
struct S
{
using value_type = int;
static const value_t<S<T>> C = 0;
};

template<class T>
const value_t<S<T>> S<T>::C;

Because the expression is now functionally equivalent. Things like substitution failure come into play on alias-templates, but as you see, the member function value_type const C doesn't have the same "prototype" as value_t<S<T>> const S<T>::C. First one doesn't have to perform SFINAE, whereas the second one requires it. So clearly both declarations have different functionality, hence GCC's tantrum.

Interestingly, Clang compiles it without a sign of abnormality. I assume it just so happens that the order of Clang's analyses are reversed, compared to GCC's. Once the alias-template expression is resolved and fine (i.e. it is well-formed), clang then compares both declarations and check it they are equivalent (which in this case they are, given both expressions resolve to value_type).

Now, which one is correct from the standard's eyes? It's still an unresolved issue to whether consider alias-template's SFNIAE as part of its declaration's functionality. Quoting [temp.alias]/2:

When a template-id refers to the specialization of an alias template, it is equivalent to the associated type obtained by substitution of its template-arguments for the template-parameters in the type-id of the alias template.

In other words, these two are equivalent:

template<class T>
struct Alloc { /* ... */ };

template<class T>
using Vec = vector<T, Alloc<T>>;

Vec<int> v;
vector<int, Alloc<int>> u;

Vec<int> and vector<int, Alloc<int>> are equivalent types, because after substitution is performed, both types end up being vector<int, Alloc<int>>. Note how "after substitution" means that the equivalence is only checked once all template arguments are replaced with the template parameters. That is, comparison starts when T in vector<T, Alloc<T>> is replaced with int from Vec<int>. Maybe that's what Clang is doing with value_t<S<T>>? But then there's the following quote from [temp.alias]/3:

However, if the template-id is dependent, subsequent template argument substitution still applies to the template-id. [Example:

template<typename...> using void_t = void;
template<typename T> void_t<typename T::foo> f();
f<int>(); // error, int does not have a nested type foo

 — end example]

Here's the problem: the expression has to be well-formed, so the compiler needs to check whether the substitution is fine. When there is a dependence in order to perform template argument substitution (e.g. typename T::foo), the functionality of the whole expression changes, and the definition of "equivalence" differs. For example, the following code won't compile (GCC and Clang):

struct X
{
template <typename T>
auto foo(T) -> std::enable_if_t<sizeof(T) == 4>;
};

template <typename T>
auto X::foo(T) -> void
{}

Because the outer foo's prototype is functionally different from the inner one. Doing auto X::foo(T) -> std::enable_if_t<sizeof(T) == 4> instead makes the code compile fine. It's so because the return type of foo is an expression that is dependent on the result of sizeof(T) == 4, so after template substitution, its prototype might be different from each instance of it. Whereas, auto X::foo(T) -> void's return type is never different, which conflicts with the declaration inside X. This is the very same issue that's happening with your code. So GCC seems to be correct in this case.

Auto formatter changes to

The VSCode C++ extension uses clang-format for formatting the document. If you are stuck with an old compiler which doesn't support C++11, just add a .clang-format file in your workspace with following line:

Standard : Cpp03

For more formatting options, refer to the following link:
https://clang.llvm.org/docs/ClangFormatStyleOptions.html

Differentiating between and when parsing generic types

When parsing a language, there are typically two main components: the scanner and the parser. The scanner produces a stream of tokens, and the parser interprets that stream based on a grammar, which is a formal definition of the production rules in the language - you can find the grammar for C# 4.0 here.

Disclaimer: I am not implying the following is necessarily how the C# language is parsed, I am merely using a C# snippet to illustrate the general concepts.

Scanning

So the first step is to produce the tokens for the parser. The tokens will generally be made up some kind of symbolic type (indicating the type of token it is), the lexeme (the actual text of the token) and perhaps other information such as the line number (useful for error handling).

So if we use List<Nullable<int>> list; from your question as an example, the scanner would produce the following tokens:

available_identifier, List
<
available_identifier, Nullable
<
integral_type, int
>
>
available_identifier, list
;

Note that the token types are inferred from the C# grammar linked to above.

Parsing

Most parsers are what's known as shift-reduce parsers. This means that tokens are shifted onto a stack incrementally, and are reduced (removed) when they match a rule. To help match, a parser will have a certain number of lookahead tokens it may observe (I believe one is most common). In general, a successful parse will conclude when all tokens have been reduced.

The type of parser that is implemented by compiler construction programs such as YACC and GPPG is known as an LALR(1) parser. These work by building up a parse table based on every legal combination of parser state and lookahead symbol, and given the current state and next symbol, can then tell us how to compute the next state.

Now that we have our tokens, we fire them into the parser, the outcome of which will usually be an abstract syntax tree, which can be used for later tasks such as code generation, semantic type checking etc. To parse those tokens, we need rules to group them into meaningful syntactic units - this is what prevents confusion when encountering >>.

From the C# grammar:

declaration_statement:
| local_variable_declaration ";"
| local_constant_declaration ";"

local_variable_declaration:
| local_variable_type local_variable_declarators

local_variable_type:
| type
| "var"

local_variable_declarators:
| local_variable_declarator
| local_variable_declarators "," local_variable_declarator

local_variable_declarator:
| identifier
| identifier "=" local_variable_initializer

type:
| value_type
| reference_type
| type_parameter
| type_unsafe

value_type:
| struct_type
| enum_type

struct_type:
| type_name
| simple_type
| nullable_type

simple_type:
| numeric_type
| bool

numeric_type:
| integral_type
| floating_point_type
| decimal

integral_type:
| "sbyte"
| "byte"
| "short"
| "ushort"
| "int"
| "uint"
| "long"
| "ulong"
| "char"

reference_type:
| class_type
| interface_type
| array_type
| delegate_type

class_type:
| type_name
| "object"
| "dynamic"
| "string"

type_name:
| namespace_or_type_name

namespace_or_type_name:
| identifier type_argument_list?
| namespace_or_type_name "." identifier type_argument_list?
| qualified_alias_member

identifier:
| available_identifier
| "@" identifier_or_keyword

type_argument_list:
| "<" type_arguments ">"

type_arguments:
| type_argument
| type_arguments "," type_argument

type_argument:
| type

Looks complicated, but stay with me. Each rule is of the form

rule_name:
| production_1
| production_2
| production_2

Each production can be another rule (a non-terminal) or a terminal. Take the integral_type rule for example: all of its productions are terminals. Rules can also refer to themselves, which is how things like the type arguments in Tuple<int, int, double> are dealt with.

For the purposes of this example, I'll assume that List<Nullable<int>> list; is a local variable declaration. You can find a more simple example on the Shift-Reduce Parsing Wikipedia page, and another on the LR Parsing page.

To begin with, our Parse Stack is empty, our single lookahead token is the very first one and our first action will be to shift that token. That is, our parser-state will look like this:

Step 0
Parse Stack: empty
Look Ahead: available_identifier
Unscanned: List<Nullable<int>> list;
Parser Action: Shift

In our next step, we can reduce the current token based on the production identifier <- available_identifier.

Step 1
Parse Stack: available_identifier
Look Ahead: "<"
Unscanned: <Nullable<int>> list;
Parser Action: Reduce by identifier <- available_identifier

Skipping ahead a few steps, at step 10 we will have the following parser-state:

Step 10
Parse Stack: identifier "<" identifier "<" type_arguments ">"
Look Ahead: ">"
Unscanned: > list;
Parser Action: Reduce by type_argument_list <- "<" type_arguments ">"

At this point we will be able to reduce the last three tokens as their sequence makes up a type_argument_list (you can check this in the rules above). Fast forward a little more to step 13 and we have the following:

Step 13
Parse Stack: identifier "<" type_arguments ">"
Look Ahead: ">"
Unscanned: list;
Parser Action: Reduce by type_argument_list <- "<" type_arguments ">"

Just like in step 10, we are reducing by type_argument_list <- "<" type_arguments ">". In doing so, we have in fact avoided any ambiguity with >>. These steps continue until we reduce by declaration_statement <- local_variable_declaration ";" - the first rule above.

Summary

By creating an unambiguous grammar, parsers are able to easily disambiguate seemingly tricky situations like List<Nullable<int>>. What I've covered here is essentially a bottom-up, LALR(1) parser. I haven't gone into the actual creation of the abstract syntax tree, but you probably have enough on your plate with the above.

Keep in mind that the rules did not include a start-state - this was mainly for the sake of brevity. If it's helpful, I can throw the rest of the parse steps in there.

Edit: f(g<a, b>(c))

What this boils down to in the grammar is two invocation_expression rules, which are of the form invocation_expression -> primary_expression ( argument_list? )

The first one matches g<a, b>(c). It does so by first establishing that g<a,b> is an identifier followed by an type_argument_list. Our lookahead is now "(", and because the parser will know from previous context that this code is in a method body, it can reduce identifier type_argument_list by

primary_expression <- primary_no_array_creation_expression
<- simple_name <- identifier type_argument_list?

After shifting "(" and c, we can reduce c by

argument_list <- argument <- argument_value <- expression
<- <a really long list of rules> <- simple_name
<- identifier <- available_identifier

And shifting that final parentheses character gives us

primary_expression ( argument_list? )

Which can then be reduced by the invocation_expression rule, thus matching g<a, b>(c).

By this point we would have already matched f as an identifier and applied the reduction

primary_expression <- primary_no_array_creation_expression
<- simple_name <- identifier type_argument_list?

So the parse stack will therefore contain the following

primary_expression "(" invocation_expression
^ ^ ^
f ( g<a, b>(c)

The lookahead symbol will by the very last ")", so the parser will reduce invocation_expression by

argument_list <- argument <- argument_value <- expression
<- <the same really long list of rules> <- primary_expression
<- primary_no_array_creation_expression <- invocation_expression

Shifting that last ")" will then give us

    primary_expression "(" argument_list ")"
^ ^ ^ ^
f ( g<a, b>(c) )

Like before, this can be reduced by the invocation_expression rule, thus matching f(g<a, b>(c)).

What are the advantages of list initialization (using curly braces)?

Basically copying and pasting from Bjarne Stroustrup's "The C++ Programming Language 4th Edition":

List initialization does not allow narrowing (§iso.8.5.4). That is:

  • An integer cannot be converted to another integer that cannot hold its value. For example, char
    to int is allowed, but not int to char.
  • A floating-point value cannot be converted to another floating-point type that cannot hold its
    value. For example, float to double is allowed, but not double to float.
  • A floating-point value cannot be converted to an integer type.
  • An integer value cannot be converted to a floating-point type.

Example:

void fun(double val, int val2) {

int x2 = val; // if val == 7.9, x2 becomes 7 (bad)

char c2 = val2; // if val2 == 1025, c2 becomes 1 (bad)

int x3 {val}; // error: possible truncation (good)

char c3 {val2}; // error: possible narrowing (good)

char c4 {24}; // OK: 24 can be represented exactly as a char (good)

char c5 {264}; // error (assuming 8-bit chars): 264 cannot be
// represented as a char (good)

int x4 {2.0}; // error: no double to int value conversion (good)

}

The only situation where = is preferred over {} is when using auto keyword to get the type determined by the initializer.

Example:

auto z1 {99};   // z1 is an int
auto z2 = {99}; // z2 is std::initializer_list<int>
auto z3 = 99; // z3 is an int


Conclusion

Prefer {} initialization over alternatives unless you have a strong reason not to.

Compiler: limitation of lexical analysis

The C++ standard requires that an implementation perform lexical analysis to produce a stream of tokens, before the parsing stage. According to the lexical analysis rules, two consecutive > characters (not followed by =) will always be interpreted as one >> token. The grammar provided with the C++ standard is defined in terms of these tokens.

The requirement that in certain contexts (such as when expecting a > within a template-id) the implementation should interpret >> as two > is not specified within the grammar. Instead the rule is specified as a special case:

14.2 Names of template specializations [temp.names] ###


After name lookup (3.4) finds that a name is a template-name or that an operator-function-id or a literal-operator-id refers to a set of overloaded functions any member of which is a function template if this is
followed by a <, the < is always taken as the delimiter of a template-argument-list and never as the less-than
operator. When parsing a template-argument-list, the first non-nested > is taken as the ending delimiter
rather than a greater-than operator. Similarly, the first non-nested >> is treated as two consecutive but
distinct > tokens, the first of which is taken as the end of the template-argument-list and completes the
template-id. [ Note: The second > token produced by this replacement rule may terminate an enclosing
template-id construct or it may be part of a different construct (e.g. a cast).—end note ]

Note the earlier rule, that in certain contexts < should be interpreted as the < in a template-argument-list. This is another example of a construct that requires context in order to disambiguate the parse.

The C++ grammar contains many such ambiguities which cannot be resolved during parsing without information about the context. The most well known of these is known as the Most Vexing Parse, in which an identifier may be interpreted as a type-name depending on context.

Keeping track of the aforementioned context in C++ requires an implementation to perform some semantic analysis in parallel with the parsing stage. This is commonly implemented in the form of semantic actions that are invoked when a particular grammatical construct is recognised in a given context. These semantic actions then build a data structure that represents the context and permits efficient queries. This is often referred to as a symbol table, but the structure required for C++ is pretty much the entire AST.

These kind of context-sensitive semantic actions can also be used to resolve ambiguities. For example, on recognising an identifier in the context of a namespace-body, a semantic action will check whether the name was previously defined as a template. The result of this will then be fed back to the parser. This can be done by marking the identifier token with the result, or replacing it with a special token that will match a different grammar rule.

The same technique can be used to mark a < as the beginning of a template-argument-list, or a > as the end. The rule for context-sensitive replacement of >> with two > poses essentially the same problem and can be resolved using the same method.



Related Topics



Leave a reply



Submit