Purpose of Trigraph Sequences in C++

Purpose of Trigraph sequences in C++?

This question (about the closely related digraphs) has the answer.

It boils down to the fact that the ISO 646 character set doesn't have all the characters of the C syntax, so there are some systems with keyboards and displays that can't deal with the characters (though I imagine that these are quite rare nowadays).

In general, you don't need to use them, but you need to know about them for exactly the problem you ran into. Trigraphs are the reason the the '?' character has an escape sequence:

'\?'

So a couple ways you can avoid your example problem are:

 printf( "What?\?!\n" ); 

printf( "What?" "?!\n" );

But you have to remember when you're typing the two '?' characters that you might be starting a trigraph (and it's certainly never something I'm thinking about).

In practice, trigraphs and digraphs are something I don't worry about at all on a day-to-day basis. But you should be aware of them because once every couple years you'll run into a bug related to them (and you'll spend the rest of the day cursing their existance). It would be nice if compilers could be configured to warn (or error) when it comes across a trigraph or digraph, so I could know I've got something I should knowingly deal with.

And just for completeness, digraphs are much less dangerous since they get processed as tokens, so a digraph inside a string literal won't get interpreted as a digraph.

For a nice education on various fun with punctuation in C/C++ programs (including a trigraph bug that would defintinely have me pulling my hair out), take a look at Herb Sutter's GOTW #86 article.


Addendum:

It looks like GCC will not process (and will warn about) trigraphs by default. Some other compilers have options to turn off trigraph support (IBM's for example). Microsoft started supporting a warning (C4837) in VS2008 that must be explicitly enabled (using -Wall or something).

Use of '??=', '?? ' and '?? ' in c

What is the significance of '??=', '??<' and '??>' here ?

??= will be replaced with #,

??< will be replaced with {,

??> will be replaced with },

by the preprocessor. These are called trigraphs. There are 9 trigraphs in total; the others are:

??( will be replaced with [,

??) will be replaced with ],

??/ will be replaced with \,

??' will be replaced with ^,

??! will be replaced with |,

??- will be replaced with ~.

Trigraphs are processed very early in the translation process, before the source code is tokenized. They can affect comments and strings and character literals.

Why are string literals parsed for trigraph sequences in Gnu gcc/g++?

Trigraphs were handled in translation phase 1 (they are removed in C++17, however). String literal related processing happens in subsequent phases. As the C++14 standard specifies (n4140) [lex.phases]/1.1:

The precedence among the syntax rules of translation is specified by
the following phases.

  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set
    (introducing new-line characters for end-of-line indicators) if
    necessary. The set of physical source file characters accepted is
    implementation-defined. Trigraph sequences ([lex.trigraph]) are
    replaced by corresponding single-character internal representations.

    Any source file character not in the basic source character set
    ([lex.charset]) is replaced by the universal-character-name that
    designates that character. (An implementation may use any internal
    encoding, so long as an actual extended character encountered in the
    source file, and the same extended character expressed in the source
    file as a universal-character-name (i.e., using the \uXXXX notation),
    are handled equivalently except where this replacement is reverted in
    a raw string literal.)

This happened first, because as you were told in comments, the characters that trigraphs stood for needed to be printable as well.

Trigraph characters

Trigraphs are disabled by default in gcc. If you are using gcc then compile with -trigraphs to enable trigraphs:

gcc -trigraphs source.c

Meaning of character literals containing trigraphs for non-representable characters

When it comes to considerations about the environment, especially to files, the C standard intentionally becomes rather vague. The following guarantees are made about trigraphs and the encoding of their corresponding characters:

C11 (n1570) 5.1.1.2 p1 (“Translation phases”) [emph. mine]

  1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

Thus, the trigraph sequence must be mapped to a single byte. This single-byte character must be in the basic character set different from any other character in the basic character set. How the compiler handles them internally during translation isn’t really observable behaviour, so it’s irrelevant.

If written to a text stream it may be converted (as I read it, maybe back to a trigraph sequence if the underlying encoding doesn’t have an encoding for a certain character). It can be read back again, and must compare equal if it is considered a printing character. Ibid. 7.21.2 p2:

[…] Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. […]

Ibid. 7.4 p3:

The term printing character refers to a member of a locale-specific set of characters, each of which occupies one printing position on a display device; the term control character refers to a member of a locale-specific set of characters that are not printing characters.*) All letters and digits are printing characters.

*) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).

And for binary streams, ibid. 7.21.2 p3:

A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation. Such a stream may, however, have an implementation- defined number of null characters appended to the end of the stream.

In the comments above, the question arose if

printf("int main(void) ??< ??>\n");     // (1) 
printf("int main(void) ?\?< ?\?>\n"); // (2)

always works for code generation and the output of that statement is guaranteed to be compilable. I couldn’t find a normative reference requiring isprint('??<') etc. (for (1)) or even isprint('<') etc (for (2)) to return non-zero, but the C89 rationale about streams says:

The set of characters required to be preserved in text stream I/O are those needed for writing C programs; the intent is the Standard should permit a C translator to be written in a maximally portable fashion. Control characters such as backspace are not required for this purpose, so their handling in text streams is not mandated.

When '??<' etc. is written to a binary stream, it must map to a single byte, be printed as such, be unique and distinguishable from any other basic character, and compare equal to '??<' when read back.


Related: C89 rationale about trigraphs.

Are trigraphs still valid C++?

Trigraphs are currently valid, but won't be for long!

Trigraphs were proposed for deprecation in C++0x, which was released
as C++11. This was opposed by IBM, speaking on behalf of itself and
other users of C++, and as a result trigraphs were retained in
C++0x. Trigraphs were then proposed again for removal (not only
deprecation) in C++17. This passed a committee vote, and trigraphs
are expected to be removed from C++17
despite the opposition from IBM
and others. Existing code that uses trigraphs can be supported by
translating from the physical source files (parsing trigraphs) to the
basic source character set that does not include trigraphs. [Wikipedia]

Digraphs, however, are sticking around for now.

Curious trigraph sequence thing about ansi C

Short answer: keyboards/character encodings that didn't include such graphs.

From wikipedia:

The basic character set of the C programming language is a superset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the keyboard being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set.

http://en.wikipedia.org/wiki/Digraphs_and_trigraphs

how can i skip those warnings? C++

A trigraph sequence is any sequence of characters that starts with "??"; the next character determines the meaning of the sequence. Trigraph sequences are (or were) used to represent characters that weren't provided on some keyboards. So, for example, "??=" means #.

Trigraph sequences aren't widely used any more; I haven't checked, but they may well have been deprecated in C++ or removed entirely. (Thanks to @johnathan for pointing out that they were removed in C++17)

In any event, if you can't turn off that warning, you can change the character sequence so that it looks the same to the compiler but isn't a trigraph. To do that, change one of the ? characters to \?. So "??=" would become "?\?="; that's not a trigraph, because it doesn't consist of the characters "??" followed by another character, but once the compiler has processed it, it's two question marks followed by an '=' sign.

Another way to rearrange the quoted strings is to separate them. So "??=" would become "??" "=" or "?" "?="; the compiler concatenates those adjacent string literals, but, again, they're not trigraphs sequences because the concatenation occurs after checking for trigraphs.

Are trigraphs required to write a newline character in C99 using only ISO 646?

Your premise:

Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly.

is questionable. C99 defines "source" and "execution" character sets, and requires that both include representations of the backslash character (C99 5.2.1). The only reason I can imagine for an effort such as you describe would be to try to produce source code that does not require character set transcoding upon movement among machines. In that case, however, the choice of ISO 646 as a common baseline is odd. You're more likely to run into an EBCDIC machine than one that uses an ISO 646 variant that is not coincident with the ISO-8859 family of character sets. (And if you can assume ISO 8859, then backslash does not present a problem.)

Nevertheless, if you insist on writing C source code without using a literal backslash character, then the trigraph for that character is the way to do so. That's what trigraphs were invented for. In character constants and string literals, you cannot portably substitute anything else for \n or its trigraph equivalent, ??/n, because it is implementation-dependent how that code is mapped. In particular, it is not safe to assume that it maps to a line-feed character (which, however, is included among the invariant characters of ISO 646).

Update:

You ask specifically whether it is possible to

include the '\n' character (which is translated to a newline in functions) in a string without the use of trigraphs, or

No, it is not possible, because there is no one '\n' character. Moreover, there seems to be a bit of a misconception here: \n in a character or string literal represents one character in the execution character set. The compiler is therefore responsible for that transformation, not the stdio functions. The stdio functions' responsibility is to handle that character on output by writing a character or character sequence intended to produce the specified effect ("[m]oves the active position to the initial position of the next line").

You also ask whether it is possible to

write a newline to a FILE * without using the '\n' character?

This one depends on exactly what you mean. If you want to write a character whose code in the execution character set you know, then you can write a numeric constant having that numeric value. In particular, if you want to write the character with encoded value 0xa (in the execution character set) then you can do so. For example, you could

fputc(0xa, my_file);

but that does not necessarily produce a result equivalent to

fputc('\n', my_file);

Why are there digraphs in C and C++?

Digraphs were created for programmers that didn't have a keyboard which supported the ISO 646 character set.

http://en.wikipedia.org/wiki/C_trigraph



Related Topics



Leave a reply



Submit