Why Are There Digraphs in C and C++

Why are there digraphs in C and C++?

Digraphs were created for programmers that didn't have a keyboard which supported the ISO 646 character set.

http://en.wikipedia.org/wiki/C_trigraph

Purpose of Trigraph sequences in C++?

This question (about the closely related digraphs) has the answer.

It boils down to the fact that the ISO 646 character set doesn't have all the characters of the C syntax, so there are some systems with keyboards and displays that can't deal with the characters (though I imagine that these are quite rare nowadays).

In general, you don't need to use them, but you need to know about them for exactly the problem you ran into. Trigraphs are the reason the the '?' character has an escape sequence:

'\?'

So a couple ways you can avoid your example problem are:

 printf( "What?\?!\n" ); 

printf( "What?" "?!\n" );

But you have to remember when you're typing the two '?' characters that you might be starting a trigraph (and it's certainly never something I'm thinking about).

In practice, trigraphs and digraphs are something I don't worry about at all on a day-to-day basis. But you should be aware of them because once every couple years you'll run into a bug related to them (and you'll spend the rest of the day cursing their existance). It would be nice if compilers could be configured to warn (or error) when it comes across a trigraph or digraph, so I could know I've got something I should knowingly deal with.

And just for completeness, digraphs are much less dangerous since they get processed as tokens, so a digraph inside a string literal won't get interpreted as a digraph.

For a nice education on various fun with punctuation in C/C++ programs (including a trigraph bug that would defintinely have me pulling my hair out), take a look at Herb Sutter's GOTW #86 article.


Addendum:

It looks like GCC will not process (and will warn about) trigraphs by default. Some other compilers have options to turn off trigraph support (IBM's for example). Microsoft started supporting a warning (C4837) in VS2008 that must be explicitly enabled (using -Wall or something).

Are digraphs transformed by a compiler and trigraphs transformed by a preprocessor?

Trigraph sequences are indeed replaced with the corresponding character at the first phase of the compiling process, before the preprocessor lexer analyses the stream of characters to produce preprocessor tokens.

The very next phase handles escaped newlines, ie: instances of \ immediately followed by a newline, which are removed from the character stream. Note that the \ can be produced by the first phase as a replacement for the ??/ trigraph.

The lexer then analyses the character stream to produce preprocessing tokens, such as [, and <: which are alternate spellings for the same token, just like 1e1 and 1E1, hence <: is not replaced with [, it is a different sequence of characters producing the same token.

Trigraphs cannot be produced by token pasting using the ## preprocessor operator in macro expansions, but digraphs can.

Here is a small sample program to illustrate this process, including th special handing of the ??/ trigraph that expands to \, thus can be used in the middle of a digraph split on 2 lines:

#include <stdio.h>

#define STR(x) #x
#define xSTR(x) STR(x)
#define glue(a,b) a##b

int main() {
puts(STR(??!));
puts(STR('??!'));
puts(STR("??!"));

puts(STR(<:));
puts(STR('<:'));
puts(STR("<:"));

puts(STR(<\
:));
puts(STR(<??/
:));
puts(STR('<\
:'));
puts(STR("<\
:"));

puts(STR(glue(<,:)));
puts(xSTR(glue(<,:)));
return 0;
}

Output:

chqrlie $ make lexing && ./lexing
clang -O3 -funsigned-char -std=c11 -Weverything -Wwrite-strings -lm -o lexing lexing.c
lexing.c:8:14: warning: trigraph converted to '|' character [-Wtrigraphs]
puts(STR(??!));
^
lexing.c:9:15: warning: trigraph converted to '|' character [-Wtrigraphs]
puts(STR('??!'));
^
lexing.c:10:15: warning: trigraph converted to '|' character [-Wtrigraphs]
puts(STR("??!"));
^
lexing.c:18:15: warning: trigraph converted to '\' character [-Wtrigraphs]
puts(STR(<??/
^
4 warnings generated.
|
'|'
"|"
<:
'<:'
"<:"
<:
<:
'<:'
"<:"
glue(<,:)
<:

Are digraphs and trigraphs in use today?

There is a proposal pending for C++1z (the next standard after C++1y will be standardized into -hopefully- C++14) that aims to remove trigraphs from the Standard. They did a case study on an otherwise undisclosed large codebase:

Case study

The uses of trigraph-like constructs in one large codebase were
examined. We discovered:

923 instances of an escaped ? in a string literal to avoid trigraph
replacement: string pattern() const { return "foo-????\?-of-?????"; }

4 instances of trigraphs being used deliberately in test code: two in
the test suite for a compiler, the other two in a test suite for
boost's preprocessor library.

0 instances of trigraphs being
deliberately used in production code. Trigraphs continue to pose a
burden on users of C++.

The proposal notes (bold emphasis from the original proposal):

If trigraphs are removed from the language entirely, an
implementation that wishes to support them can continue to do so: its
implementation-defined mapping from physical source file characters to
the basic source character set can include trigraph translation (and
can even avoid doing so within raw string literals). We do not need
trigraphs in the standard for backwards compatibility
.

Digraph and trigraph can't work together?

Digraphs and trigraphs are totally different. Trigraphs are replaced during phase 1 of translation, [see Note 1] which is before the source code has been separated into tokens. Digraphs are tokens which are alternate spellings for other tokens, so they are not meaningful until after the source has been separated into tokens. (The word "digraph" is not very accurate; it is used because it resembles "trigraph", but the set of digraphs includes %:%: which consists of four characters.)

So ??= is replaced with a # before any token analysis is done. But %: is just a token, with the same meaning as #.

Moreover, %:%: is a token with the same meaning as ##. But %:# is two tokens (%: and #), which is not legal since the stringify operator (whether spelled %: or #) can only be followed by a macro parameter. [See Note 2] And it does not become any less illegal if the # were the result of a trigraph substitution.

One important difference between digraphs and trigraphs, as illustrated by the hilarious snippet in chqrlie's answer, is that trigraphs also work in strings. Digraphs allow you to write C code even if your keyboard lacks brackets and octothorpi, but they don't help you print those characters out.


Notes (Standards quotes):

  1. §5.1.1.2, Translation phases, paragraph 1:

    The precedence among the syntax rules of translation is specified by the following phases.

    1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
  2. §6.10.3.2, The # operator, paragraph 1:

    Each # preprocessing token in the replacement list for a function-like macro shall be
    followed by a parameter as the next preprocessing token in the replacement list.

C++17 why not remove digraphs along with trigraphs?

Trigraphs are more problematic to the unaware user than digraphs. This is because they are replaced within string literals and comments. Here are some examples…

Example A:

std::string example = "What??!??!";
std::cout << example << std::endl;

What|| will be printed to the console. This is because of the trigraph ??! being translated to |.

Example B:

// Error ?!?!?!??!??/
std::cout << "There was an error!" << std::endl;

Nothing will happen at all. This is because ??/ translates to \, which escapes the newline character and results in the next line being commented out.

Example C:

// This makes no sense ?!?!!?!??!??/
std::string example = "Hello World";
std::cout << example << std::endl;

This will give an error along the lines of use of undeclared identifier "example" for the same reasons as Example B.

There are far more elaborate problems trigraphs can cause too, but you get the idea. It's worth noting that many compilers actually emit a warning when such translations are being made; yet another reason to always treat warnings as errors. However this is not required by the standard and therefore cannot be relied upon.

Digraphs are much less problematic than trigraphs, as they are not replaced inside another token (i.e. a string or character literal) and there is not a sequence that translates to \, so escaping new lines in comments cannot occur.

Conclusion

Other than harder to read code, there are less problems caused by digraphs and therefore the need to remove them is greatly reduced.

Why does GCC emit a warning when using trigraphs, but not when using digraphs?

This gcc document on pre-processing gives a pretty good rationale for a warning (emphasis mine):

Trigraphs are not popular and many compilers implement them incorrectly. Portable code should not rely on trigraphs being either converted or ignored. With -Wtrigraphs GCC will warn you when a trigraph may change the meaning of your program if it were converted.

and in this gcc document on Tokenization explains digraphs unlike trigraphs do not potential negative side effects (emphasis mine):

There are also six digraphs, which the C++ standard calls alternative tokens, which are merely alternate ways to spell other punctuators. This is a second attempt to work around missing punctuation in obsolete systems. It has no negative side effects, unlike trigraphs,

Why are there alternative tokens in C++?

They're a hangover from C, really. There were implementations of C in which not all characters were available (such as some variants of EBCDIC that have no square brackets).

The C99 rationale document, section 5.2.1.1 Trigraph sequences has this to say:

Trigraph sequences were introduced in C89 as alternate spellings of some characters to allow the implementation of C in character sets which do not provide a sufficient number of non-alphabetic graphics.

The characters in the ASCII repertoire used by C and absent from the ISO/IEC 646 invariant repertoire are #, [, ], {, }, \, |, ~, and ^

Are trigraphs still valid C++?

Trigraphs are currently valid, but won't be for long!

Trigraphs were proposed for deprecation in C++0x, which was released
as C++11. This was opposed by IBM, speaking on behalf of itself and
other users of C++, and as a result trigraphs were retained in
C++0x. Trigraphs were then proposed again for removal (not only
deprecation) in C++17. This passed a committee vote, and trigraphs
are expected to be removed from C++17
despite the opposition from IBM
and others. Existing code that uses trigraphs can be supported by
translating from the physical source files (parsing trigraphs) to the
basic source character set that does not include trigraphs. [Wikipedia]

Digraphs, however, are sticking around for now.

Preprocessing C99 digraphs away

As far as I know, there is no standard tool which does this transformation. In particular, the preprocessor does not substitute digraphs, because (unlike trigraphs) digraphs are just ordinary tokens which happen to mean the same thing as other ordinary tokens.

It would be relatively straightforward to write such a processor using flex, starting with an existing flex definition for C.



Related Topics



Leave a reply



Submit