What are connecting characters in Java identifiers?
Here is a list of connecting characters. These are characters used to connect words.
http://www.fileformat.info/info/unicode/category/Pc/list.htm
U+005F _ LOW LINE
U+203F ‿ UNDERTIE
U+2040 ⁀ CHARACTER TIE
U+2054 ⁔ INVERTED UNDERTIE
U+FE33 ︳ PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE34 ︴ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
U+FE4D ﹍ DASHED LOW LINE
U+FE4E ﹎ CENTRELINE LOW LINE
U+FE4F ﹏ WAVY LOW LINE
U+FF3F _ FULLWIDTH LOW LINE
This compiles on Java 7.
int _, ‿, ⁀, ⁔, ︳, ︴, ﹍, ﹎, ﹏, _;
An example. In this case tp
is the name of a column and the value for a given row.
Column<Double> ︴tp︴ = table.getColumn("tp", double.class);
double tp = row.getDouble(︴tp︴);
The following
for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++)
if (Character.isJavaIdentifierStart(i) && !Character.isAlphabetic(i))
System.out.print((char) i + " ");
}
prints
$ _ ¢ £ ¤ ¥ ؋ ৲ ৳ ৻ ૱ ௹ ฿ ៛ ‿ ⁀ ⁔ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ꠸ ﷼ ︳ ︴ ﹍ ﹎ ﹏ ﹩ $ _ ¢ £ ¥ ₩
How compile java using unicode characters in identifiers
No, you can't.
An identifier has to start with a so-called Java letter that is
[...] a character for which the method
Character.isJavaIdentifierStart(int)
returnstrue
.
Which in turn means
A character [
ch
] may start a Java identifier if and only if one of the following conditions is true:
isLetter(ch)
returns truegetType(ch)
returns LETTER_NUMBERch
is a currency symbol (such as '$')ch
is a connecting punctuation character (such as '_').
The (optional) subsequent characters must be a Java letter-or-digit, that is
[...] a character for which the method
Character.isJavaIdentifierPart(int)
returnstrue
.
Which in turn means
A character may be part of a Java identifier if any of the following conditions are true:
- it is a letter
- it is a currency symbol (such as '$')
- it is a connecting punctuation character (such as '_')
- it is a digit
- it is a numeric letter (such as a Roman numeral character)
- it is a combining mark
- it is a non-spacing mark
isIdentifierIgnorable
returns true for the character
None of the above is true for either or /strong>, but it is for сделайЧтонибудь
which is, in fact, a valid identifier.
What you could do (why bother, tho) is write a pre-processor that translates those emojis into sequences of Java letters, with its output being a java program with valid identifiers which you can finally feed to the compiler.
Why does Java allow control characters in its identifiers?
The Java Language Specification section 3.8 defers to Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart(). The latter, among other conditions, has Character.isIdentifierIgnorable(), which allows non-whitespace control characters (including whole C1 range, see the link for the list).
What's an ignorable character in a Java identifier
There is an open issue for this contradiction.
In summary, these characters are indeed ignored for identifier name matching by the compiler but JLS doesn't mention this. Instead JLS says:
Two identifiers are the same only if they are identical, that is, have
the same Unicode character for each letter or digit.
Also
A "Java letter-or-digit" is a character for which the method
Character.isJavaIdentifierPart(int) returns true
The contradiction is obvious as:
Character.isJavaIdentifierPart('\u0001') -> true, so used to compare identifier names
Character.isIdentifierIgnorable('\u0001') -> true, should be ignored actually
I speculate that Intellij IDEA follows the JLS or they are simply unaware of ignorable characters. I don't see a bug report for this here.
As to what is the purpose of these ignorables, unicode specifies some Layout and Format Control Characters. It is suggested that these characters should be ignored in identifier names as
the effects they represent are stylistic or otherwise out of scope for
identifiers, and second because the characters themselves often have
no visible display
Apparently the purpose of isIdentifierIgnorable
is to identify characters of this category. For instance it's mentioned in the isIdentifierIgnorable documentation that it returns true
for characters that have the FORMAT general category value which are characters with unicode General_Category value of Cf which are included in the Layout and Format Control Characters
Why can't '#', '.', ':' be used in identifiers?
"Should not" is not and exact definition. Better use "must not" or "cannot".
Once we changed a question we can answer it. The reason is that this way the java programming language is defined. So, you can as "why did java creators defined such rules?"
There can be several answers. One of the most relevant (IMHO) is that all programming languages (at least those that I know) have more or less equal definition of what characters can be used in identifiers:
Letters, digits or underscore, starting from letter or underscore.
By the way java extends this rule because it permits to use any letter including national alphabets while typically other (at least older) programming languages restrict this to Latin letters only.
Among all characters that you wrote I think only #
could be theoretically included into a list of characters permitted for identifiers but they decided not to do it probably thinking about future releases of java where probably this character will become a part of the language.
I think that usage of ,
and ;
in identifier is obviously impossible. Think about for
operator.
java identifiers
If isJavaIdentifierStart
returns true for it, then by definition, it's a valid Java identifier starting character, because that's how the specification defines it:
Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
JavaLetter
IdentifierChars JavaLetterOrDigit
JavaLetter:
any Unicode character that is a Java letter (see below)
JavaLetterOrDigit:
any Unicode character that is a Java letter-or-digit (see below)
...
A "Java letter" is a character for which the methodCharacter.isJavaIdentifierStart(int)
returnstrue
.
My grammar identifies keywords as identifiers
Your grammar is a lexer grammar, meaning it produces only tokens. Learn the difference between lexer
, parser
and combined grammars here: https://github.com/antlr/antlr4/blob/master/doc/grammars.md
In short, remove the word lexer
from your grammar and change some rules into parser rules (these start with a lower case letter):
grammar Mini;
program: 'program' Identifier body EOF;
body: ('declare' decl_list) 'begin' stmt_list 'end';
decl_list: decl ';' (decl ';')?;
decl: type ident_list;
ident_list: (Identifier ','?)*;
type: 'integer' | 'decimal';
stmt_list: stmt ';' (stmt ';')*;
stmt: assign_stmt | if_stmt | while_stmt| read_stmt | write_stmt | for_stmt;
assign_stmt: Identifier ':=' simple_expr;
if_stmt: 'if' condition 'then' stmt_list 'end' | 'if' condition 'then' stmt_list 'else' stmt_list 'end';
condition: expression;
for_stmt: 'for' assign_stmt 'to' condition 'do' stmt_list 'end';
while_stmt: 'while' condition 'do' stmt_list 'end';
read_stmt: 'read' '(' Identifier ')';
write_stmt: 'write' '(' writable ')';
writable: simple_expr | Literal;
expression: simple_expr | simple_expr Relop simple_expr;
simple_expr: term | term Addop term| '(' term ')' ? term ':' term;
term: factor_a | factor_a Mulop factor_a;
factor_a: factor | 'not' factor | '-' factor;
factor: Identifier | Constant | '(' expression ')';
Relop: '=' | '>' | '>=' | '<' | '<=' | '<>';
Addop: '+' | '-' | 'or';
Mulop: '*' | '/' | 'mod' | 'and';
Shiftop: '<<' | '>>' | '<<<' | '>>>';
COMENTARIO: '%' ~('\n'|'\r')* '\r'? '\n' -> skip;
Constant: ('0'..'9') (('0'..'9'))*;
Literal: '"' ('\u0000'..'\uFFFE')* '"';
Identifier: ('a'..'z'|'A'..'Z') (('a'..'z'|'A'..'Z') | ('0'..'9'))*;
Space: [ \t\r\n] -> skip;
Note that {skip();}
is old v3 syntax, use -> skip
instead.
And Constant: ('0'..'9') (('0'..'9'))*;
is also old v3 syntax (although still valid in v4). The preferred way to do it is like this:
Constant: [0-9] (([0-9]))*;
which can simply be written as:
Constant: [0-9]+;
Java Unicode variable names
The Unicode standard defines what counts as a letter.
From the Java Language Specification, section 3.8:
Letters and digits may be drawn from
the entire Unicode character set,
which supports most writing scripts in
use in the world today, including the
large sets for Chinese, Japanese, and
Korean. This allows programmers to use
identifiers in their programs that are
written in their native languages.A
"Java letter" is a character for which
the method
Character.isJavaIdentifierStart(int)
returns true. A "Java letter-or-digit"
is a character for which the method
Character.isJavaIdentifierPart(int)
returns true.
From the Character
documenation for isJavaIdentifierPart
:
Determines if the character (Unicode code point) may be part of a Java identifier as other
than the first character.
A character may be part of a Java identifier if any of the following are true:
- it is a letter
- it is a currency symbol (such as '$')
- it is a connecting punctuation character (such as '_')
- it is a digit
- it is a numeric letter (such as a Roman numeral character)
- it is a combining mark
- it is a non-spacing mark
- isIdentifierIgnorable(codePoint) returns true for the character
Related Topics
How to Combine Two Hashmap Objects Containing the Same Types
Easy Way of Running the Same Junit Test Over and Over
Calculating Difference in Dates in Java
Integer.Valueof() VS. Integer.Parseint()
Java Error - Actual and Formal Argument Lists Differ in Length
What Does the "Assert" Keyword Do
Why Jscrollpane in Joptionpane Not Showing All Its Content
Spring Cache @Cacheable - Not Working While Calling from Another Method of the Same Bean
How to Install the Jdk on Ubuntu Linux
Timer & Timertask Versus Thread + Sleep in Java
Is Java a Compiled or an Interpreted Programming Language
Synchronization of Non-Final Field
Immutable VS Unmodifiable Collection
Httpservletrequest - How to Obtain the Referring Url