What Are "Connecting Characters" in Java Identifiers

What are connecting characters in Java identifiers?

Here is a list of connecting characters. These are characters used to connect words.

http://www.fileformat.info/info/unicode/category/Pc/list.htm

U+005F _ LOW LINE
U+203F ‿ UNDERTIE
U+2040 ⁀ CHARACTER TIE
U+2054 ⁔ INVERTED UNDERTIE
U+FE33 ︳ PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE34 ︴ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
U+FE4D ﹍ DASHED LOW LINE
U+FE4E ﹎ CENTRELINE LOW LINE
U+FE4F ﹏ WAVY LOW LINE
U+FF3F _ FULLWIDTH LOW LINE

This compiles on Java 7.

int _, ‿, ⁀, ⁔, ︳, ︴, ﹍, ﹎, ﹏, _;

An example. In this case tp is the name of a column and the value for a given row.

Column<Double> ︴tp︴ = table.getColumn("tp", double.class);

double tp = row.getDouble(︴tp︴);

The following

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++)
if (Character.isJavaIdentifierStart(i) && !Character.isAlphabetic(i))
System.out.print((char) i + " ");
}

prints

$ _ ¢ £ ¤ ¥ ؋ ৲ ৳ ৻ ૱ ௹ ฿ ៛ ‿ ⁀ ⁔ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ꠸ ﷼ ︳ ︴ ﹍ ﹎ ﹏ ﹩ $ _ ¢ £ ¥ ₩

How compile java using unicode characters in identifiers

No, you can't.

An identifier has to start with a so-called Java letter that is

[...] a character for which the method Character.isJavaIdentifierStart(int) returns true.

Which in turn means

A character [ch] may start a Java identifier if and only if one of the following conditions is true:

  • isLetter(ch) returns true
  • getType(ch) returns LETTER_NUMBER
  • ch is a currency symbol (such as '$')
  • ch is a connecting punctuation character (such as '_').

The (optional) subsequent characters must be a Java letter-or-digit, that is

[...] a character for which the method Character.isJavaIdentifierPart(int) returns true.

Which in turn means

A character may be part of a Java identifier if any of the following conditions are true:

  • it is a letter
  • it is a currency symbol (such as '$')
  • it is a connecting punctuation character (such as '_')
  • it is a digit
  • it is a numeric letter (such as a Roman numeral character)
  • it is a combining mark
  • it is a non-spacing mark
  • isIdentifierIgnorable returns true for the character

None of the above is true for either or /strong>, but it is for сделайЧтонибудь which is, in fact, a valid identifier.


What you could do (why bother, tho) is write a pre-processor that translates those emojis into sequences of Java letters, with its output being a java program with valid identifiers which you can finally feed to the compiler.

Why does Java allow control characters in its identifiers?

The Java Language Specification section 3.8 defers to Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart(). The latter, among other conditions, has Character.isIdentifierIgnorable(), which allows non-whitespace control characters (including whole C1 range, see the link for the list).

What's an ignorable character in a Java identifier

There is an open issue for this contradiction.

In summary, these characters are indeed ignored for identifier name matching by the compiler but JLS doesn't mention this. Instead JLS says:

Two identifiers are the same only if they are identical, that is, have
the same Unicode character for each letter or digit.

Also

A "Java letter-or-digit" is a character for which the method
Character.isJavaIdentifierPart(int) returns true

The contradiction is obvious as:

Character.isJavaIdentifierPart('\u0001')  -> true, so used to compare identifier names
Character.isIdentifierIgnorable('\u0001') -> true, should be ignored actually

I speculate that Intellij IDEA follows the JLS or they are simply unaware of ignorable characters. I don't see a bug report for this here.

As to what is the purpose of these ignorables, unicode specifies some Layout and Format Control Characters. It is suggested that these characters should be ignored in identifier names as

the effects they represent are stylistic or otherwise out of scope for
identifiers, and second because the characters themselves often have
no visible display

Apparently the purpose of isIdentifierIgnorable is to identify characters of this category. For instance it's mentioned in the isIdentifierIgnorable documentation that it returns true for characters that have the FORMAT general category value which are characters with unicode General_Category value of Cf which are included in the Layout and Format Control Characters

Why can't '#', '.', ':' be used in identifiers?

"Should not" is not and exact definition. Better use "must not" or "cannot".

Once we changed a question we can answer it. The reason is that this way the java programming language is defined. So, you can as "why did java creators defined such rules?"

There can be several answers. One of the most relevant (IMHO) is that all programming languages (at least those that I know) have more or less equal definition of what characters can be used in identifiers:

Letters, digits or underscore, starting from letter or underscore.

By the way java extends this rule because it permits to use any letter including national alphabets while typically other (at least older) programming languages restrict this to Latin letters only.

Among all characters that you wrote I think only # could be theoretically included into a list of characters permitted for identifiers but they decided not to do it probably thinking about future releases of java where probably this character will become a part of the language.

I think that usage of , and ; in identifier is obviously impossible. Think about for operator.

java identifiers

If isJavaIdentifierStart returns true for it, then by definition, it's a valid Java identifier starting character, because that's how the specification defines it:

Identifier:

      IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral

IdentifierChars:

      JavaLetter

      IdentifierChars JavaLetterOrDigit

JavaLetter:

      any Unicode character that is a Java letter (see below)

JavaLetterOrDigit:

      any Unicode character that is a Java letter-or-digit (see below)

...

A "Java letter" is a character for which the method Character.isJavaIdentifierStart(int) returns true.

My grammar identifies keywords as identifiers

Your grammar is a lexer grammar, meaning it produces only tokens. Learn the difference between lexer, parser and combined grammars here: https://github.com/antlr/antlr4/blob/master/doc/grammars.md

In short, remove the word lexer from your grammar and change some rules into parser rules (these start with a lower case letter):

grammar Mini;

program: 'program' Identifier body EOF;

body: ('declare' decl_list) 'begin' stmt_list 'end';

decl_list: decl ';' (decl ';')?;

decl: type ident_list;

ident_list: (Identifier ','?)*;

type: 'integer' | 'decimal';

stmt_list: stmt ';' (stmt ';')*;

stmt: assign_stmt | if_stmt | while_stmt| read_stmt | write_stmt | for_stmt;

assign_stmt: Identifier ':=' simple_expr;

if_stmt: 'if' condition 'then' stmt_list 'end' | 'if' condition 'then' stmt_list 'else' stmt_list 'end';

condition: expression;

for_stmt: 'for' assign_stmt 'to' condition 'do' stmt_list 'end';

while_stmt: 'while' condition 'do' stmt_list 'end';

read_stmt: 'read' '(' Identifier ')';

write_stmt: 'write' '(' writable ')';

writable: simple_expr | Literal;

expression: simple_expr | simple_expr Relop simple_expr;

simple_expr: term | term Addop term| '(' term ')' ? term ':' term;

term: factor_a | factor_a Mulop factor_a;

factor_a: factor | 'not' factor | '-' factor;

factor: Identifier | Constant | '(' expression ')';

Relop: '=' | '>' | '>=' | '<' | '<=' | '<>';

Addop: '+' | '-' | 'or';

Mulop: '*' | '/' | 'mod' | 'and';

Shiftop: '<<' | '>>' | '<<<' | '>>>';

COMENTARIO: '%' ~('\n'|'\r')* '\r'? '\n' -> skip;

Constant: ('0'..'9') (('0'..'9'))*;

Literal: '"' ('\u0000'..'\uFFFE')* '"';

Identifier: ('a'..'z'|'A'..'Z') (('a'..'z'|'A'..'Z') | ('0'..'9'))*;

Space: [ \t\r\n] -> skip;

Note that {skip();} is old v3 syntax, use -> skip instead.

And Constant: ('0'..'9') (('0'..'9'))*; is also old v3 syntax (although still valid in v4). The preferred way to do it is like this:

Constant: [0-9] (([0-9]))*;

which can simply be written as:

Constant: [0-9]+;

Java Unicode variable names

The Unicode standard defines what counts as a letter.

From the Java Language Specification, section 3.8:

Letters and digits may be drawn from
the entire Unicode character set,
which supports most writing scripts in
use in the world today, including the
large sets for Chinese, Japanese, and
Korean. This allows programmers to use
identifiers in their programs that are
written in their native languages.

A
"Java letter" is a character for which
the method
Character.isJavaIdentifierStart(int)
returns true. A "Java letter-or-digit"
is a character for which the method
Character.isJavaIdentifierPart(int)
returns true.

From the Character documenation for isJavaIdentifierPart:

Determines if the character (Unicode code point) may be part of a Java identifier as other
than the first character.
A character may be part of a Java identifier if any of the following are true:

  • it is a letter
  • it is a currency symbol (such as '$')
  • it is a connecting punctuation character (such as '_')
  • it is a digit
  • it is a numeric letter (such as a Roman numeral character)
  • it is a combining mark
  • it is a non-spacing mark
  • isIdentifierIgnorable(codePoint) returns true for the character


Related Topics



Leave a reply



Submit