Why Was the Space Character Not Chosen for C++14 Digit Separators

Why was the space character not chosen for C++14 digit separators?

There is a previous paper, n3499, which tell us that although Bjarne himself suggested spaces as separators:

While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.

  • It does not match the syntax for a pp-number, and would minimally require extending that syntax.
  • More importantly, there would be some syntactic ambiguity when a hexadecimal digit in the range [a-f] follows a space. The preprocessor would not know whether to perform symbol substitution starting after the space.
  • It would likely make editing tools that grab "words" less reliable.

I guess the following example is the main problem noted:

const int x = 0x123 a;

though in my opinion this rationale is fairly weak. I still can't think of a real-world example to break it.

The "editing tools" rationale is even worse, since 1'234 breaks basically every syntax highlighter known to mankind (e.g. that used by Markdown in the above question itself!) and makes updated versions of said highlighters much harder to implement.

Still, for better or worse, this is the rationale that led to the adoption of apostrophes instead.

Are C++14 digit separators allowed in user defined literals?

If you look at the grammar, user-defined-integer-literal can be octal-literal ud-suffix, and octal-literal is defined as either 0 or octal-literal ’opt octal-digit.

N4140 §2.14.8

user-defined-literal:

  • user-defined-integer-literal
  • [...]

user-defined-integer-literal:

  • octal-literal ud-suffix
  • [...]

N4140 §2.14.2

octal-literal:

  • 0
  • octal-literal ’opt octal-digit

So 01'23s is a perfectly valid literal.

Why does the C parser not allow spaces between the digits of an integer literal?

The language doesn't allow this (an integer literal is one token, the intervening whitespace splits it into two tokens) but there's typically little to no expense incurred by expressing the initializer as an expression that is a calculation of literals:

int i = 10 * 1000; /* ten thousand */

Why was the space character not chosen for C++14 digit separators?

There is a previous paper, n3499, which tell us that although Bjarne himself suggested spaces as separators:

While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.

  • It does not match the syntax for a pp-number, and would minimally require extending that syntax.
  • More importantly, there would be some syntactic ambiguity when a hexadecimal digit in the range [a-f] follows a space. The preprocessor would not know whether to perform symbol substitution starting after the space.
  • It would likely make editing tools that grab "words" less reliable.

I guess the following example is the main problem noted:

const int x = 0x123 a;

though in my opinion this rationale is fairly weak. I still can't think of a real-world example to break it.

The "editing tools" rationale is even worse, since 1'234 breaks basically every syntax highlighter known to mankind (e.g. that used by Markdown in the above question itself!) and makes updated versions of said highlighters much harder to implement.

Still, for better or worse, this is the rationale that led to the adoption of apostrophes instead.

Are digit separators allowed before the digits in a hex or binary number?

If we look at the grammar from the draft C++14 standard: N4140 section 2.14.2 [lex.icon], it is not allowed right after the base indicator of hexadecimal or binary literals:

binary-literal:
0b binary-digit
0B binary-digit
binary-literal ’opt binary-digit
[...]
hexadecimal-literal:
0x hexadecimal-digit
0X hexadecimal-digit
hexadecimal-literal ’opt hexadecimal-digit

Although, octal literals do allow the separator after the base indicator:

octal-literal:
0
octal-literal ’opt octal-digit

We can also check using one of the online compiler which provide C++14 compilers such as Coliru or Wandbox.

The Evolution Working Group issue which tracked this change was issue 27: N3781 Single-Quotation-Mark as a Digit Separator, N3661, N3499 Digit Separators, N3448 Painless Digit Separation. I don't see an obvious rationale for this design decision, perhaps it is solely a literal interpretation of digit separator.

Note we can find a list of the draft standards from Where do I find the current C or C++ standard documents?.

Digit Places in Processing

If i got you, I've been there once, the way I found to achieve this kind of input of normal calculators is using strings for input converted later to floats or ints. I was dealing with timecode, so there was no floats or dots. You need to add that, but the idea is:

[edit] added a simple dot input handler, it seams to work:)

StringBuilder buff = new StringBuilder("");

void draw(){}

void keyReleased()
{

if(key != CODED){
char c = key;
if ( c >= '0' && c <= '9' || c == '.'){
buff.insert(buff.length(),c);
}
println("buff = " + buff);
}
print("in float... plus 100.25 equals: ");
println(100.25 + float(buff.toString()));
}

Meaning of character literals containing trigraphs for non-representable characters

When it comes to considerations about the environment, especially to files, the C standard intentionally becomes rather vague. The following guarantees are made about trigraphs and the encoding of their corresponding characters:

C11 (n1570) 5.1.1.2 p1 (“Translation phases”) [emph. mine]

  1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

Thus, the trigraph sequence must be mapped to a single byte. This single-byte character must be in the basic character set different from any other character in the basic character set. How the compiler handles them internally during translation isn’t really observable behaviour, so it’s irrelevant.

If written to a text stream it may be converted (as I read it, maybe back to a trigraph sequence if the underlying encoding doesn’t have an encoding for a certain character). It can be read back again, and must compare equal if it is considered a printing character. Ibid. 7.21.2 p2:

[…] Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. […]

Ibid. 7.4 p3:

The term printing character refers to a member of a locale-specific set of characters, each of which occupies one printing position on a display device; the term control character refers to a member of a locale-specific set of characters that are not printing characters.*) All letters and digits are printing characters.

*) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).

And for binary streams, ibid. 7.21.2 p3:

A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation. Such a stream may, however, have an implementation- defined number of null characters appended to the end of the stream.

In the comments above, the question arose if

printf("int main(void) ??< ??>\n");     // (1) 
printf("int main(void) ?\?< ?\?>\n"); // (2)

always works for code generation and the output of that statement is guaranteed to be compilable. I couldn’t find a normative reference requiring isprint('??<') etc. (for (1)) or even isprint('<') etc (for (2)) to return non-zero, but the C89 rationale about streams says:

The set of characters required to be preserved in text stream I/O are those needed for writing C programs; the intent is the Standard should permit a C translator to be written in a maximally portable fashion. Control characters such as backspace are not required for this purpose, so their handling in text streams is not mandated.

When '??<' etc. is written to a binary stream, it must map to a single byte, be printed as such, be unique and distinguishable from any other basic character, and compare equal to '??<' when read back.


Related: C89 rationale about trigraphs.



Related Topics



Leave a reply



Submit