Is C++ Context-Free or Context-Sensitive

Is C++ context-free or context-sensitive?

Below is my (current) favorite demonstration of why parsing C++ is (probably) Turing-complete, since it shows a program which is syntactically correct if and only if a given integer is prime.

So I assert that C++ is neither context-free nor context-sensitive.

If you allow arbitrary symbol sequences on both sides of any production, you produce an Type-0 grammar ("unrestricted") in the Chomsky hierarchy, which is more powerful than a context-sensitive grammar; unrestricted grammars are Turing-complete. A context-sensitive (Type-1) grammar allows multiple symbols of context on the left hand side of a production, but the same context must appear on the right hand side of the production (hence the name "context-sensitive"). [1] Context-sensitive grammars are equivalent to linear-bounded Turing machines.

In the example program, the prime computation could be performed by a linear-bounded Turing machine, so it does not quite prove Turing equivalence, but the important part is that the parser needs to perform the computation in order to perform syntactic analysis. It could have been any computation expressible as a template instantiation and there is every reason to believe that C++ template instantiation is Turing-complete. See, for example, Todd L. Veldhuizen's 2003 paper.

Regardless, C++ can be parsed by a computer, so it could certainly be parsed by a Turing machine. Consequently, an unrestricted grammar could recognize it. Actually writing such a grammar would be impractical, which is why the standard doesn't try to do so. (See below.)

The issue with "ambiguity" of certain expressions is mostly a red herring. To start with, ambiguity is a feature of a particular grammar, not a language. Even if a language can be proven to have no unambiguous grammars, if it can be recognized by a context-free grammar, it's context-free. Similarly, if it cannot be recognized by a context-free grammar but it can be recognized by a context-sensitive grammar, it's context-sensitive. Ambiguity is not relevant.

But in any event, like line 21 (i.e. auto b = foo<IsPrime<234799>>::typen<1>();) in the program below, the expressions are not ambiguous at all; they are simply parsed differently depending on context. In the simplest expression of the issue, the syntactic category of certain identifiers is dependent on how they have been declared (types and functions, for example), which means that the formal language would have to recognize the fact that two arbitrary-length strings in the same program are identical (declaration and use). This can be modelled by the "copy" grammar, which is the grammar which recognizes two consecutive exact copies of the same word. It's easy to prove with the pumping lemma that this language is not context-free. A context-sensitive grammar for this language is possible, and a Type-0 grammar is provided in the answer to this question: https://math.stackexchange.com/questions/163830/context-sensitive-grammar-for-the-copy-language .

If one were to attempt to write a context-sensitive (or unrestricted) grammar to parse C++, it would quite possibly fill the universe with scribblings. Writing a Turing machine to parse C++ would be an equally impossible undertaking. Even writing a C++ program is difficult, and as far as I know none have been proven correct. This is why the standard does not attempt to provide a complete formal grammar, and why it chooses to write some of the parsing rules in technical English.

What looks like a formal grammar in the C++ standard is not the complete formal definition of the syntax of the C++ language. It's not even the complete formal definition of the language after preprocessing, which might be easier to formalize. (That wouldn't be the language, though: the C++ language as defined by the standard includes the preprocessor, and the operation of the preprocessor is described algorithmically since it would be extremely hard to describe in any grammatical formalism. It is in that section of the standard where lexical decomposition is described, including the rules where it must be applied more than once.)

The various grammars (two overlapping grammars for lexical analysis, one which takes place before preprocessing and the other, if necessary, afterwards, plus the "syntactic" grammar) are collected in Appendix A, with this important note (emphasis added):

This summary of C++ syntax is intended to be an aid to comprehension. It is not an exact statement of the language. In particular, the grammar described here accepts a superset of valid C++ constructs. Disambiguation rules (6.8, 7.1, 10.2) must be applied to distinguish expressions from declarations. Further, access control, ambiguity, and type rules must be used to weed out syntactically valid but meaningless constructs.

Finally, here's the promised program. Line 21 is syntactically correct if and only if the N in IsPrime<N> is prime. Otherwise, typen is an integer, not a template, so typen<1>() is parsed as (typen<1)>() which is syntactically incorrect because () is not a syntactically valid expression.

template<bool V> struct answer { answer(int) {} bool operator()(){return V;}};

template<bool no, bool yes, int f, int p> struct IsPrimeHelper
: IsPrimeHelper<p % f == 0, f * f >= p, f + 2, p> {};
template<bool yes, int f, int p> struct IsPrimeHelper<true, yes, f, p> { using type = answer<false>; };
template<int f, int p> struct IsPrimeHelper<false, true, f, p> { using type = answer<true>; };

template<int I> using IsPrime = typename IsPrimeHelper<!(I&1), false, 3, I>::type;
template<int I>
struct X { static const int i = I; int a[i]; };

template<typename A> struct foo;
template<>struct foo<answer<true>>{
template<int I> using typen = X<I>;
};
template<> struct foo<answer<false>>{
static const int typen = 0;
};

int main() {
auto b = foo<IsPrime<234799>>::typen<1>(); // Syntax error if not prime
return 0;
}

[1] To put it more technically, every production in a context-sensitive grammar must be of the form:

αAβ → αγβ

where A is a non-terminal and α, β are possibly empty sequences of grammar symbols, and γ is a non-empty sequence. (Grammar symbols may be either terminals or non-terminals).

This can be read as A → γ only in the context [α, β]. In a context-free (Type 2) grammar, α and β must be empty.

It turns out that you can also restrict grammars with the "monotonic" restriction, where every production must be of the form:

α → β where |α| ≥ |β| > 0  (|α| means "the length of α")

It's possible to prove that the set of languages recognized by monotonic grammars is exactly the same as the set of languages recognized by context-sensitive grammars, and it's often the case that it's easier to base proofs on monotonic grammars. Consequently, it's pretty common to see "context-sensitive" used as though it meant "monotonic".

How to make C language context-free?

Yes. C can be parsed with a classical lex + yacc combo. The lexer definition and the yacc grammar are freely available at

http://www.quut.com/c/ANSI-C-grammar-l-2011.html
and
http://www.quut.com/c/ANSI-C-grammar-y-2011.html

As you can see from the lex file, it's straightforward except for the context-sensitive check_type() (and comment(), but comment processing technically belongs to the preprocessor), which makes typedef the only source of context-sensitivity there. Since the yacc file doesn't contain any context-sensitivity introducing tricks either, a typedef-less C would be a perfectly context-free language.

What programming languages are context-free?

The set of programs that are syntactically correct is context-free for almost all languages.

The set of programs that compile is not context-free for almost all languages. For example, if the set of all compiling C programs were context free, then by intersecting with a regular language (also known as a regex), the set of all compiling C programs that match

^int main\(void\) { int a+; a+ = a+; return 0; }$

would be context-free, but this is clearly isomorphic to the language a^kba^kba^k, which is well-known not to be context-free.

Is C# considered a context free language?

I would describe C# as having a context-free grammar, but the language has context-sensitive rules that are not expressed in the grammar.

From Wikipedia (Formal grammar):

A context-free grammar is a grammar in which the left-hand side of each production rule consists of only a single nonterminal symbol.

From the C# 4.0 specification, section 2.2.1 (Grammar notation):

The lexical and syntactic grammars are presented using grammar productions. Each grammar production defines a non-terminal symbol and the possible expansions of that non-terminal symbol into sequences of non-terminal or terminal symbols.

As I read it, that means that the production rules defining the C# language are context-free. The left-hand side of each production rule is a single non-terminal symbol. By contrast, a context-sensitive grammar may have multiple terminal and non-terminal symbols on the left-hand side of a production rule.

There are, however, many rules in the specification that are context-sensitive. For example, "A local variable must be definitely assigned (§5.3) at each location where its value is obtained." Also, "The signature of a method must be unique in the class in which the method is declared." These are both cases where the validity of any particular fragment depends on the context in which it appears.

Of course, many programming languages have similar requirements, including C. I doubt most would consider C a context-sensitive language. I think this answer summarized it well:

The set of programs that are syntactically correct is context-free for almost all languages. The set of programs that compile is not context-free for almost all languages.

As for Python and F#, as I said in my comment, those languages are usually described as having semantic (or sometimes syntactic) whitespace not context-sensitive.

Are the grammars of modern programming languages context-free or context-sensitive?

C++ is neither context-free nor context-sensitive, since the template system is Turing-complete and determining whether a piece of C++ code is legal C++ is undecidably hard. For example, I could define a template class that simulates a TM on a string and then creates a constant with value 1 if the machine accepts and 0 if it does not. If I did that, then the following code would be legal iff the TM halted on the given input:

int myArray[TMTemplate</* ... args ... */>::value];

Since if the TM rejects, this creates an array of size 0, which is not allowed.

Neither C# nor Java is context-free, because checking of whether a variable is used correctly and consistently throughout a particular scope is known not to be context-free (the proof is complex and relies on Ogden's lemma). However, I'm not sure whether or not they are context-sensitive.

Hope this gives a partial answer to your questions!

What does context mean by context-free and context-sensitive grammars?

The direction of the arrow is important, so if I can't talk about it, it's going to be difficult to explain. So, sorry, I'm going to use arrows. They really aren't complicated.

The expression A -> ... means "an A is ...". It does not mean "... is an A". Context-free means that if A can be "..." in some context, it can be "..." in any context. But the arrow always points from category to specific; never backwards.

In your example, an identifier is a letter followed by a bunch of alphanumeric symbols:

 identifier -> letter (letter OR digit)...

So identifier could be var. That doesn't mean that var is always an identifier, as your example shows. The arrow points in one direction.

Because the grammar is context-free, if when we are looking for an identifier in some context and we accept var as an identifier, then in any other context where we are looking for an identifier, we must also accept var.

But there are contexts (between quotes) where we are not looking for an identifier. That's fine; the context-free condition has not been broken. The context applies in the direction of the arrow.

Which languages are context-sensitive?

Context-sensitive grammar has symbols that change their meaning in relation with various nonterminals they are used in their context. In computer world, they are pretty rare, because it quite complicates writing of the parser - decision if a string belongs to a certain context-sensitive grammar is PSPACE-complete.

Is Rust's syntactical grammar context-free or context-sensitive?

Rust includes a macro processor, whose operation is highly context-sensitive.

You could attempt to skate around this issue by only doing syntactic analysis up to but not including macro expansion -- possible, but not particularly useful -- or by assuming that the macro expansion is done by some intermediate tool which is given a free pass to allow it to be Turing complete.

But I'm inclined to say that it simply means that the Rust language is recursively enumerable.

There are a number of restrictions on the validity of macro definitions which probably make the language (at least) context-sensitive, even if you settle for not performing the macro expansions as part of syntactic analysis.

This doesn't mean that a context-free grammar cannot be useful as part of the syntactic analysis of Rust. It's probably essential, and it could even be useful to use a parser generator such as bison or Antlr (and examples of both exist). Like most programming languages, there is a simple superset of Rust which is context-free, and which can be usefully analysed with context-free grammar tools; however, in the end there are texts which will need to be rejected at compile-time as invalid even though they are part of the CF superset.



Related Topics



Leave a reply



Submit