😃 (And Other Unicode Characters) in Identifiers Not Allowed by G++

😃 (and other Unicode characters) in identifiers not allowed by g++

As of 4.8, gcc does not support characters outside of the BMP used as identifiers. It seems to be an unnecessary restriction. Also, gcc only supports a very restricted set of character described in ucnid.tab, based on C99 and C++98 (it is not updated to C11 and C++11 yet, it seems).

As described in the manual, -fextended-identifiers is experimental, so it has a higher chance won't work as expected.


Edit:

GCC supported the C11 character set starting from 4.9.0 (svn r204886 to be precise). So OP's second piece of code using \U0001F603 does work. I still can't get the actual code using /code> to work even with -finput-charset=UTF-8 with GCC 8.2 on https://gcc.godbolt.org though (You may want to follow this bug report, provided by @DanielWolf).

Meanwhile both pieces of code work on clang 3.3 without any options other than -std=c++11.

Unicode/special characters in variable names in clang not allowed?

So the clang document says (emphasis mine):

This feature allows identifiers to contain certain Unicode characters,
as specified by the active language standard;

This is covered in the draft C++ standard Annex E, the characters allowed are as follows:

E.1 Ranges of characters allowed [charname.allowed]

00A8, 00AA, 00AD,

00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

0100-167F, 1681-180D, 180F-1FFF 200B-200D, 202A-202E, 203F-2040, 2054,

2060-206F 2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF

3004-3007, 3021-302F, 3031-303F

3040-D7FF F900-FD3D, FD40-FDCF,

FDF0-FE44, FE47-FFFD

10000-1FFFD, 20000-2FFFD, 30000-3FFFD,
40000-4FFFD, 50000-5FFFD, 60000-6FFFD, 70000-7FFFD, 80000-8FFFD,
90000-9FFFD, A0000-AFFFD, B0000-BFFFD, C0000-CFFFD, D0000-DFFFD,
E0000-EFFFD

The code for infinity 221E is not included in the list.

For reference: these are the codes above converted to unicode characters (some of them may not display correctly in all browsers/available fonts).

¨, ª, ­,

¯, ²-µ, ·-º, ¼-¾, À-Ö, Ø-ö, ø-ÿ

Ā-ᙿ, ᚁ-᠍, ᠏-῿ ​-‍, ‪-‮, ‿-⁀, ⁔,

⁠- ⁰-↏, ①-⓿, ❶-➓, Ⰰ-ⷿ, ⺀-⿿

〄-〇, 〡-〯, 〱-〿

぀-퟿ 豈-ﴽ, ﵀-﷏,

ﷰ-﹄, ﹇-�

-, -, -, -, -, -, -, -, -, -, -, -, -, -br>

I could not find an extensive document that covers the rationale for the ranges chosen although N3146: Recommendations for extended identifier characters for C and C++ does provides some details on the influences.

g++ unicode variable name

You have to specify the -fextended-identifiers flag when compiling, you also have to use \uXXXX or \uXXXXXXXX for unicode(atleast in gcc it's unicode)

Identifiers (variable/class names etc) in g++ can't be of utf-8/utf-16 or whatever encoding,
they have to be:

identifier:
nondigit
identifier nondigit
identifier digit

a nondigit is

nondigit: one of
universalcharactername
_ a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

and a universalcharactername is

universalcharactername:
\UXXXXXXXX
\uXXXX

Thus, if you save your source file as UTF-8, you cannot have a variable like e.g.:

int høyde = 10;

it had to be written like:

int h\u00F8yde = 10;

(which imo would beat the whole purpose - so just stick with a-z)

How to know if a Unicode character is 'allowed' in 'Security Profile for General Identifiers' for PHP?

Looks like you have a very structured text document there. Seems perfect for a regex!

^([0-9A-F]+)(\.\.[0-9A-F]+)?\s*;\sAllowed

I ran that regex against the file (with g and m modifiers), and I got a bunch of matches. But I doubt anyone wants to match these by hand. More regex!

So I ran this regex:

^([0-9A-F]+)

and this replacement:

\\x{$1}

This replaced about half of the instances, so it looks like:

\x{0030}
..0039
\x{0041}
..005A
\x{005F}
\x{0061}
..007A

But we need more regex...

\.\.([0-9A-F]+)

and this replacement:

-\\x{$1}

Now it looks like this:

\x{0030}
-\x{0039}
\x{0041}
-\x{005A}
\x{005F}
\x{0061}
-\x{007A}

Almost... After removing all \s with another regex, I can add a [ and a ] and I get...

The Solution

It's a char class, so use it like you would any other. Warning: very long...

[\x{0030}-\x{0039}\x{0041}-\x{005A}\x{005F}\x{0061}-\x{007A}\x{00C0}-\x{00D6}\x{00D8}-\x{00F6}\x{00F8}-\x{0131}\x{0134}-\x{013E}\x{0141}-\x{0148}\x{014A}-\x{017E}\x{018F}\x{01A0}-\x{01A1}\x{01AF}-\x{01B0}\x{01CD}-\x{01DC}\x{01DE}-\x{01E3}\x{01E6}-\x{01F0}\x{01F4}-\x{01F5}\x{01F8}-\x{021B}\x{021E}-\x{021F}\x{0226}-\x{0233}\x{0259}\x{02BB}-\x{02BC}\x{02EC}\x{0300}-\x{0304}\x{0306}-\x{030C}\x{030F}-\x{0311}\x{0313}-\x{0314}\x{031B}\x{0323}-\x{0328}\x{032D}-\x{032E}\x{0330}-\x{0331}\x{0335}\x{0338}-\x{0339}\x{0342}\x{0345}\x{037B}-\x{037D}\x{0386}\x{0388}-\x{038A}\x{038C}\x{038E}-\x{03A1}\x{03A3}-\x{03CE}\x{03FC}-\x{045F}\x{048A}-\x{0529}\x{052E}-\x{052F}\x{0531}-\x{0556}\x{0559}\x{0561}-\x{0586}\x{05B4}\x{05D0}-\x{05EA}\x{05F0}-\x{05F2}\x{0620}-\x{063F}\x{0641}-\x{0655}\x{0660}-\x{0669}\x{0670}-\x{0672}\x{0674}\x{0679}-\x{068D}\x{068F}-\x{06D3}\x{06D5}\x{06E5}-\x{06E6}\x{06EE}-\x{06FC}\x{06FF}\x{0750}-\x{07B1}\x{08A0}-\x{08AC}\x{08B2}\x{0901}-\x{094D}\x{094F}-\x{0950}\x{0956}-\x{0957}\x{0960}-\x{0963}\x{0966}-\x{096F}\x{0971}-\x{0977}\x{0979}-\x{097F}\x{0981}-\x{0983}\x{0985}-\x{098C}\x{098F}-\x{0990}\x{0993}-\x{09A8}\x{09AA}-\x{09B0}\x{09B2}\x{09B6}-\x{09B9}\x{09BC}-\x{09C4}\x{09C7}-\x{09C8}\x{09CB}-\x{09CE}\x{09D7}\x{09E0}-\x{09E3}\x{09E6}-\x{09F1}\x{0A01}-\x{0A03}\x{0A05}-\x{0A0A}\x{0A0F}-\x{0A10}\x{0A13}-\x{0A28}\x{0A2A}-\x{0A30}\x{0A32}\x{0A35}\x{0A38}-\x{0A39}\x{0A3C}\x{0A3E}-\x{0A42}\x{0A47}-\x{0A48}\x{0A4B}-\x{0A4D}\x{0A5C}\x{0A66}-\x{0A74}\x{0A81}-\x{0A83}\x{0A85}-\x{0A8D}\x{0A8F}-\x{0A91}\x{0A93}-\x{0AA8}\x{0AAA}-\x{0AB0}\x{0AB2}-\x{0AB3}\x{0AB5}-\x{0AB9}\x{0ABC}-\x{0AC5}\x{0AC7}-\x{0AC9}\x{0ACB}-\x{0ACD}\x{0AD0}\x{0AE0}-\x{0AE3}\x{0AE6}-\x{0AEF}\x{0B01}-\x{0B03}\x{0B05}-\x{0B0C}\x{0B0F}-\x{0B10}\x{0B13}-\x{0B28}\x{0B2A}-\x{0B30}\x{0B32}-\x{0B33}\x{0B35}-\x{0B39}\x{0B3C}-\x{0B43}\x{0B47}-\x{0B48}\x{0B4B}-\x{0B4D}\x{0B56}-\x{0B57}\x{0B5F}-\x{0B61}\x{0B66}-\x{0B6F}\x{0B71}\x{0B82}-\x{0B83}\x{0B85}-\x{0B8A}\x{0B8E}-\x{0B90}\x{0B92}-\x{0B95}\x{0B99}-\x{0B9A}\x{0B9C}\x{0B9E}-\x{0B9F}\x{0BA3}-\x{0BA4}\x{0BA8}-\x{0BAA}\x{0BAE}-\x{0BB9}\x{0BBE}-\x{0BC2}\x{0BC6}-\x{0BC8}\x{0BCA}-\x{0BCD}\x{0BD0}\x{0BD7}\x{0BE6}-\x{0BEF}\x{0C01}-\x{0C03}\x{0C05}-\x{0C0C}\x{0C0E}-\x{0C10}\x{0C12}-\x{0C28}\x{0C2A}-\x{0C33}\x{0C35}-\x{0C39}\x{0C3D}-\x{0C44}\x{0C46}-\x{0C48}\x{0C4A}-\x{0C4D}\x{0C55}-\x{0C56}\x{0C60}-\x{0C61}\x{0C66}-\x{0C6F}\x{0C82}-\x{0C83}\x{0C85}-\x{0C8C}\x{0C8E}-\x{0C90}\x{0C92}-\x{0CA8}\x{0CAA}-\x{0CB3}\x{0CB5}-\x{0CB9}\x{0CBC}-\x{0CC4}\x{0CC6}-\x{0CC8}\x{0CCA}-\x{0CCD}\x{0CD5}-\x{0CD6}\x{0CE0}-\x{0CE3}\x{0CE6}-\x{0CEF}\x{0CF1}-\x{0CF2}\x{0D02}-\x{0D03}\x{0D05}-\x{0D0C}\x{0D0E}-\x{0D10}\x{0D12}-\x{0D3A}\x{0D3D}-\x{0D43}\x{0D46}-\x{0D48}\x{0D4A}-\x{0D4E}\x{0D57}\x{0D60}-\x{0D61}\x{0D66}-\x{0D6F}\x{0D7A}-\x{0D7F}\x{0D82}-\x{0D83}\x{0D85}-\x{0D8E}\x{0D91}-\x{0D96}\x{0D9A}-\x{0DA5}\x{0DA7}-\x{0DB1}\x{0DB3}-\x{0DBB}\x{0DBD}\x{0DC0}-\x{0DC6}\x{0DCA}\x{0DCF}-\x{0DD4}\x{0DD6}\x{0DD8}-\x{0DDE}\x{0DF2}\x{0E01}-\x{0E32}\x{0E34}-\x{0E3A}\x{0E40}-\x{0E4E}\x{0E50}-\x{0E59}\x{0E81}-\x{0E82}\x{0E84}\x{0E87}-\x{0E88}\x{0E8A}\x{0E8D}\x{0E94}-\x{0E97}\x{0E99}-\x{0E9F}\x{0EA1}-\x{0EA3}\x{0EA5}\x{0EA7}\x{0EAA}-\x{0EAB}\x{0EAD}-\x{0EB2}\x{0EB4}-\x{0EB9}\x{0EBB}-\x{0EBD}\x{0EC0}-\x{0EC4}\x{0EC6}\x{0EC8}-\x{0ECD}\x{0ED0}-\x{0ED9}\x{0EDE}-\x{0EDF}\x{0F00}\x{0F20}-\x{0F29}\x{0F35}\x{0F37}\x{0F3E}-\x{0F42}\x{0F44}-\x{0F47}\x{0F49}-\x{0F4C}\x{0F4E}-\x{0F51}\x{0F53}-\x{0F56}\x{0F58}-\x{0F5B}\x{0F5D}-\x{0F68}\x{0F6A}-\x{0F6C}\x{0F71}-\x{0F72}\x{0F74}\x{0F7A}-\x{0F80}\x{0F82}-\x{0F84}\x{0F86}-\x{0F92}\x{0F94}-\x{0F97}\x{0F99}-\x{0F9C}\x{0F9E}-\x{0FA1}\x{0FA3}-\x{0FA6}\x{0FA8}-\x{0FAB}\x{0FAD}-\x{0FB8}\x{0FBA}-\x{0FBC}\x{0FC6}\x{1000}-\x{1049}\x{1050}-\x{109D}\x{10C7}\x{10CD}\x{10D0}-\x{10F0}\x{10F7}-\x{10FA}\x{10FD}-\x{10FF}\x{1200}-\x{1248}\x{124A}-\x{124D}\x{1250}-\x{1256}\x{1258}\x{125A}-\x{125D}\x{1260}-\x{1288}\x{128A}-\x{128D}\x{1290}-\x{12B0}\x{12B2}-\x{12B5}\x{12B8}-\x{12BE}\x{12C0}\x{12C2}-\x{12C5}\x{12C8}-\x{12D6}\x{12D8}-\x{1310}\x{1312}-\x{1315}\x{1318}-\x{135A}\x{135D}-\x{135F}\x{1380}-\x{138F}\x{1780}-\x{17A2}\x{17A5}-\x{17A7}\x{17A9}-\x{17B3}\x{17B6}-\x{17CA}\x{17D2}\x{17D7}\x{17DC}\x{17E0}-\x{17E9}\x{1E00}-\x{1E99}\x{1E9E}\x{1EA0}-\x{1EF9}\x{1F00}-\x{1F15}\x{1F18}-\x{1F1D}\x{1F20}-\x{1F45}\x{1F48}-\x{1F4D}\x{1F50}-\x{1F57}\x{1F59}\x{1F5B}\x{1F5D}\x{1F5F}-\x{1F70}\x{1F72}\x{1F74}\x{1F76}\x{1F78}\x{1F7A}\x{1F7C}\x{1F80}-\x{1FB4}\x{1FB6}-\x{1FBA}\x{1FBC}\x{1FC2}-\x{1FC4}\x{1FC6}-\x{1FC8}\x{1FCA}\x{1FCC}\x{1FD0}-\x{1FD2}\x{1FD6}-\x{1FDA}\x{1FE0}-\x{1FE2}\x{1FE4}-\x{1FEA}\x{1FEC}\x{1FF2}-\x{1FF4}\x{1FF6}-\x{1FF8}\x{1FFA}\x{1FFC}\x{2D27}\x{2D2D}\x{2D80}-\x{2D96}\x{2DA0}-\x{2DA6}\x{2DA8}-\x{2DAE}\x{2DB0}-\x{2DB6}\x{2DB8}-\x{2DBE}\x{2DC0}-\x{2DC6}\x{2DC8}-\x{2DCE}\x{2DD0}-\x{2DD6}\x{2DD8}-\x{2DDE}\x{3005}-\x{3007}\x{3041}-\x{3096}\x{3099}-\x{309A}\x{309D}-\x{309E}\x{30A1}-\x{30FA}\x{30FC}-\x{30FE}\x{3105}-\x{312D}\x{31A0}-\x{31BA}\x{3400}-\x{4DB5}\x{4E00}-\x{9FD5}\x{A660}-\x{A661}\x{A674}-\x{A67B}\x{A67F}\x{A69F}\x{A717}-\x{A71F}\x{A788}\x{A78D}-\x{A78E}\x{A790}-\x{A793}\x{A7A0}-\x{A7AA}\x{A7FA}\x{A9E7}-\x{A9FE}\x{AA60}-\x{AA76}\x{AA7A}-\x{AA7F}\x{AB01}-\x{AB06}\x{AB09}-\x{AB0E}\x{AB11}-\x{AB16}\x{AB20}-\x{AB26}\x{AB28}-\x{AB2E}\x{AC00}-\x{D7A3}\x{FA0E}-\x{FA0F}\x{FA11}\x{FA13}-\x{FA14}\x{FA1F}\x{FA21}\x{FA23}-\x{FA24}\x{FA27}-\x{FA29}\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B740}-\x{2B81D}\x{2B820}-\x{2CEA1}\x{0027}\x{002D}-\x{002E}\x{003A}\x{00B7}\x{0375}\x{058A}\x{05F3}-\x{05F4}\x{06FD}-\x{06FE}\x{0F0B}\x{200C}-\x{200D}\x{2010}\x{2019}\x{2027}\x{30A0}\x{30FB}]

You can use it like this (where "regex" is the code above):

$re = "/regex/u";

$str = "t";
echo preg_match($re, $str, $matches);//1

echo "<br>";

$str = "(̶";
echo preg_match($re, $str, $matches);//0

Unicode Identifiers and Source Code in C++11?

Is the new standard more open w.r.t to Unicode?

With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)

It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:

long pörk;

not:

long p\u00F6rk;

However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).

UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:

char const *degree_sign = "\u00b0";

This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.

Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?

It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.

(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)

Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)

Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.

Or can i use the "character names" that unicode defines like in the ICU, i.e.

const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;

No, you cannot use Unicode long names.

or even in an identifier in the source itself? That would be a treat... cough...

If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.

Why doesn't these unicode variable names work with -fextended-identifiers? «, » and ≠

The C++ Standard requires (section 2.10):

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in E.2. Upper- and lower-case letters are different. All characters are significant.

And E.1:

Ranges of characters allowed [charname.allowed]

  • 00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

  • 0100-167F, 1681-180D, 180F-1FFF

  • 200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F

  • 2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF

  • 3004-3007, 3021-302F, 3031-303F

  • 3040-D7FF

  • F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD

  • 10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
    60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
    B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
    0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F

Your angle brackets are 0x300A and 0x300B, which are not included. Not equal is 0x2260, also disallowed.

What constitutes a valid C Identifier?

As others have mentioned, Annex D of ISO/IEC 9899:2011 lists the hexadecimal values of characters valid for universal character names in C11. (I won't bother repeating it here.) I have been searching for an answer as to "why" this list was chosen.

Character set standards

First, there are two relevant standards defining a set of characters: ISO/IEC 10646 (defining UCS) and Unicode. To further confuse (or simplify) things, they both define the same characters since the ISO and Unicode keep them synchronized. UCS is essentially just a character map associating values to a set of characters ("repertoire"), while Unicode also gives further definitions such how to compare strings in an alphabetical sorting order (collation), which code points represent "canonically equivalent" characters (normalization), and a bidirectional algorithm for how to process characters in languages written right to left, and more.

Universal character names in C

Universal character names (UCN) was a feature newly added in C99 (ISO/IEC 9899:1999). In the "Rationale for International Standard---Programming Languages---C" (Rev. 2, Oct. 1999), the purpose was "to enable the use of any 'native' character in identifiers, string literals and character constants, while retaining the portability objective of C" (sec. 5.2.1). This section continues on about issues of how to encode these characters in C (the \U and \u forms versus multibyte characters or native encodings) and policy models of how to deal with it (p.14, see PDF page 22).

Rationale

I was hoping that the same "rationale" document from 1999 would give a reason of why each extended character range was selected as acceptable for C99's UCNs. The entirety of the rationale's Annex I is:

Annex I Universal character names for identifiers (normative)

A new feature of C9X.

This is not much of a rationale. They didn't even know what year the C standard would be published, so it's just called "C9X". A later rationale document from 2003 is slightly more enlightening:

Annex D Universal character names for identifiers (normative)

New feature for C99.

The intention is to keep current with ISO/IEC TR 10176.

ISO/IEC TR 10176 is "Guidelines for the preparation of programming language standards." It a basically a guidebook for people who write programming language standards. It includes guidelines for the use of character sets in programming languages as well as a "recommended extended repertoire for user-defined identifiers" (Annex A). But this quote from the 2003 rationale document is only an "intention to keep current," not a pledge of strict adherence to TR 10176.

There is a publicly available ISO/IEC TR 10176:2003 table of characters. The character values refer to ISO 10646. The table classifies ranges of characters from numerous languages as being "uppercase" Lu; "lowercase" Ll; "number, decimal digit" Nd, "punctuation, connector" Pc; etc. It should be clear what use such classifications have to a programming language.

An important reminder is that TR 10176 is a Technical Report, and not a standard. I have found several passing references to it on forums and in documents related to other programming languages, such as Ada, COBOL, and D language. Much of the discussion was about how closely standards of those languages should follow TR 10176 (not being a standard) and complaints that TR 10176 was lagging behind updates to ISO 10646.

Perhaps most enlightening is document WG21/N3146: "Recommendations for extended identifier characters for C and C++." It starts with a comment in 2010 to the standards body recommending restrictions on the initial characters of identifiers. It mentions similar complaints about C referencing TR 10176, and makes suggestions about what characters should be allowed as initial characters of an identifier based on restrictions from Unicode's Identifier and Pattern Syntax and XML's Common Syntactic Constructs. WG21/N3146 gives the proposed wording that later appeared in the C11 standard ISO/IEC 9899:2011. There is a table at the end of the document that helps shed light on the character ranges selected.

Characters allowed and not allowed in C11

Below is a compiled list of ranges for extended identifier characters. The boldface ranges are those given in C11 (ISO/IEC 9899:2011 Annex D). Some comments are added about the italicized ranges not listed in C11 (i.e. not allowed). They are either marked in WG21/N3146 as disallowed by Unicode's UAX#31 or XML's Common Syntactic Constructs, or prohibited by some other comment.

00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00C0-00D6, 00D8-00F6, 00F8-00FF: (Various characters, such as feminine ª and masculine º ordinal indicators, vowels with diacritics, numeric characters such as superscript numbers, fractions, etc.)

(previous gaps): All disallowed by UAX31 and/or XML. (Generally punctuation type marks like «», monetary symbols ¥£, mathematical operators ×÷, etc.)

0100-167F: (Latin, Greek, Cyrillic, Arabic, Thai, Ethiopic, etc.---many others)

1680: "The Ogham block contains a script-specific space:  "

1681-180D: (Ogham, Tagalog, Mongolian, etc.)

180E: "The Mongolian block contains a script-specific space"

180F-1FFF: (More languages... phonetics, extended Latin & Greek, etc.)

2000: starts the "General Punctuation" block, but some are allowed:

200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F: (selections from "General Punctuation" block)

2070−218F: "Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols, Letterlike Symbols, Number Forms"

2190-245F: "Arrows, Mathematical Operators, Miscellaneous Technical, Control Pictures, Optical Character Recognition"

2460-24FF: "Enclosed Alphanumerics"

2500: starts "Box Drawing, Block Elements, Geometric Shapes", etc.

2776-2793: (some dingbats and circled dingbats)

2794-2BFF: (a different dingbat set, mathematical symbols, arrows, Braille patterns, etc.)

2C00-2DFF, 2E80-2FFF: "Glagolitic, Latin Extended-C, Coptic, Georgian Supplement, Tifinagh, Ethiopic Extended, Cyrillic Extended-A" (also CJK radical supplement)

3000: (start of "CJK Symbols and Punctuation", some selections allowed)

3004-3007, 3021-302F, 3031-303F: (allowed "CJK Symbols and Punctuation")

3040-D7FF: "Hiragana, Katakana," more CJK ideograms, radicals, etc.

D800-F8FF: (This starts the High and Low Surrogate Areas (number space needed for encodings), and Private Use)

F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD: selections from "CJK Compatibility Ideographs," "Arabic Presentation Forms," etc.
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD,
60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD,
B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
: WG21/N3146 gives the rationale for these final ranges:

The Supplementary Private Use Area extends from F0000 through 10FFFF; both [AltId] and [XML2008] disallow characters in that range.

In addition, [AltId] disallows, as non-characters, the last two code positions of each plane, i.e. every position of the form PFFFE or PFFFF, for any value of P.

The "Ranges of characters disallowed initially" from C11 Annex D.2 are 0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F.

With this WG21/N3146 placed next to the Annex D of the C11 standard, much can be inferred about how they line up. For example, mathematical operators and punctuation seem to be not allowed. I hope this sheds some light on "why" or "how" the allowed characters were chosen.



TLDR; version

  • Authoritative source for legal identifier characters is the C11 standard ISO/IEC 9899:2011 (See Annex D).
  • This list is based on a technical report, ISO/IEC TR 10176, but with modifications.

Valid Identifiers containing Special Characters - C Programming

Is someone able to explain why: '_variable$2' is a valid C programming identifier?

It isn't, in the sense that a strictly-conforming C program cannot use identifiers that contain the '$' character.

I thought that only letters, digits and the '_' (underscore) characters were allowed.

Only the underscore, the decimal digits, the unaccented upper- and lowercase letters, and universal character names are required to be allowed in C identifiers (the universal character names are new in C11). However, the standard explicitly permits implementations to define other characters that they accept as well.

However variable names such as '_variable$2' are completely valid and run as normal when compiled and tested.

That one implementation accepts such identifiers does not make them "completely valid". It just makes them valid in that implementation.

And if this is the case, what other special characters can and cannot be used similarly? Is this limited to simply characters or could even emoji's be substituted into valid identifier names within the C programming language?

The standard specifies that the list of additional characters accepted in identifiers is implementation defined. This has a specific meaning in the standard: conforming implementations must document their choices for all implementation defined characteristics. Therefore, if you're willing to rely on the specific characteristics of some chosen implementation, then you should find a list or description of that implementation's allowed extra characters in its documentation.

On the other hand, if you want your program to work unchanged with multiple different C implementations, then you should stick to only letters, digits, and the underscore, and maybe universal character names in identifiers.

And don't be too quick to overlook those universal character names: to the extent that emoji (and many other characters) are encoded by Unicode, you can use UCNs to include them in your identifiers, at least in a logical sense, provided that you are content to rely on C11.

Is there any problem to use characters `$` or `@` in C++ code?

It is not a good idea by any means, but if you come from the world of JavaScript (e.g. the famous jQuery's $) or other languages and like that, and you want to be fancy, you can indeed use a lot of things!

For instance, $ works as an extension in many compilers:

int $() {
return 42;
}

You can also use other Unicode characters:

int ᚁᚂᚃ() {
return 42;
}

And, you can even use emoji:

int () {
return 42;
}

See e.g. Does C++11 allow dollar signs in identifiers? for more formal details.

Also, note that under MSVC you will probably want /utf-8 and /permissive- if you want to play with this.



Related Topics



Leave a reply



Submit