Why Is Executing Java Code in Comments with Certain Unicode Characters Allowed

Why is executing Java code in comments with certain Unicode characters allowed?

Unicode decoding takes place before any other lexical translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding. You don't even need to figure out where comments begin and end!

As stated in JLS Section 3.3 this allows any ASCII based tool to process the source files:

[...] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. [...]

This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform.

Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect.

There are many gotchas on this theme and Java Puzzlers by Joshua Bloch and Neal Gafter included the following variant:

Is this a legal Java program? If so, what does it print?

\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020\u0020
\u0063\u006c\u0061\u0073\u0073\u0020\u0055\u0067\u006c\u0079
\u007b\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020
\u0020\u0020\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063
\u0076\u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028
\u0053\u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0020
\u0020\u0020\u0020\u0020\u0061\u0072\u0067\u0073\u0029\u007b
\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f\u0075\u0074
\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e\u0028\u0020
\u0022\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u0022\u002b
\u0022\u006f\u0072\u006c\u0064\u0022\u0029\u003b\u007d\u007d

(This program turns out to be a plain "Hello World" program.)

In the solution to the puzzler, they point out the following:

More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can’t be represented in any other way into your program. Avoid them in all other cases.


Source: Java: Executing code in comments?!

Is executing C++ code in comments with certain Unicode characters allowed, like in Java?

If I read this translation phase reference correctly, then the sequence

// \u000d some code here

is mapped in phase 1 to itself, i.e. the parser does not translate or expand \u000d. Instead the translation of such sequences happens in phase 5, which is after the comments are replaced by a space in phase 3.

So to answer your question: C++ does not "execute" (or parse) code in comments, not even with a Unicode newline in it.

How unicode characters are read in java comments

Before compilation every Unicode character is replaced by its value and since \u000a represents new line code

// String st4="MN3444\u000ar4t4";

is same as this code (notice that text after \u000a will be moved to new line, which means it will no longer be part of commented)

// String st4="MN3444
r4t4";

You can test it with

//\u000a;System.out.println("hello comment");

which is equal to

//
System.out.println("hello comment");

and will give you as result output: hello comment

How compile java using unicode characters in identifiers

No, you can't.

An identifier has to start with a so-called Java letter that is

[...] a character for which the method Character.isJavaIdentifierStart(int) returns true.

Which in turn means

A character [ch] may start a Java identifier if and only if one of the following conditions is true:

  • isLetter(ch) returns true
  • getType(ch) returns LETTER_NUMBER
  • ch is a currency symbol (such as '$')
  • ch is a connecting punctuation character (such as '_').

The (optional) subsequent characters must be a Java letter-or-digit, that is

[...] a character for which the method Character.isJavaIdentifierPart(int) returns true.

Which in turn means

A character may be part of a Java identifier if any of the following conditions are true:

  • it is a letter
  • it is a currency symbol (such as '$')
  • it is a connecting punctuation character (such as '_')
  • it is a digit
  • it is a numeric letter (such as a Roman numeral character)
  • it is a combining mark
  • it is a non-spacing mark
  • isIdentifierIgnorable returns true for the character

None of the above is true for either or /strong>, but it is for сделайЧтонибудь which is, in fact, a valid identifier.


What you could do (why bother, tho) is write a pre-processor that translates those emojis into sequences of Java letters, with its output being a java program with valid identifiers which you can finally feed to the compiler.

Unicode in javadoc and comments?

Some compilers failed on non-ASCII characters in JavaDoc and source code comments.

This is likely because the compiler assumes that the input is UTF-8, and there are invalid UTF-8 sequences in the source file. That these appear to be in comments in your source code editor is irrelevant because the lexer (which distinguishes comments from other tokens) never gets to run. The failure occurs while the tool is trying to convert bytes into chars before the lexer runs.


The man page for javac and javadoc say

-encoding name
Specifies the source file encoding name, such as
EUCJIS/SJIS. If this option is not specified, the plat-
form default converter is used.

so running javadoc with the encoding flag

javadoc -encoding <encoding-name> ...

after replacing <encoding-name> with the encoding you've used for your source files should cause it to use the right encoding.

If you've got more than one encoding used within a group of source files that you need to compile together, you need to fix that first and settle on a single uniform encoding for all source files. You should really just use UTF-8 or stick to ASCII.


What is the current (Java 7) and future (Java 8 and beyond) practices with respect to Unicode in Java source files?

The algorithm for dealing with a source file in Java is

  1. Collect bytes
  2. Convert bytes to chars (UTF-16 code units) using some encoding.
  3. Replace all sequences of '\\' 'u' followed by four hex digits with the code-unit corresponding to those hex-digits. Error out if there is a "\u" not followed by four hex digits.
  4. Lex the chars into tokens.
  5. Parse the tokens into classes.

The current and former practice is that step 2, converting bytes to UTF-16 code units, is up to the tool that is loading the compilation unit (source file) but the de facto standard for command line interfaces is to use the -encoding flag.

After that conversion happens, the language mandates that \uABCD style sequences are converted to UTF-16 code units (step 3) before lexing and parsing.

For example:

int a;
\u0061 = 42;

is a valid pair of Java statements.
Any java source code tool must, after converting bytes to chars but before parsing, look for \uABCD sequences and convert them so this code is converted to

int a;
a = 42;

before parsing. This happens regardless of where the \uABCD sequence occurs.

This process looks something like

  1. Get bytes: [105, 110, 116, 32, 97, 59, 10, 92, 117, 48, 48, 54, 49, 32, 61, 32, 52, 50, 59]
  2. Convert bytes to chars: ['i', 'n', 't', ' ', 'a', ';', '\n', '\\', 'u', '0', '0', '6', '1', ' ', '=', ' ', '4', '2', ';']
  3. Replace unicode escapes: ['i', 'n', 't', ' ', 'a', ';', '\n', a, ' ', '=', ' ', '4', '2', ';']
  4. Lex: ["int", "a", ";", "a", "=", "42", ";"]
  5. Parse: (Block (Variable (Type int) (Identifier "a")) (Assign (Reference "a") (Int 42)))

Should all non-ASCII characters be escaped in JavaDoc with HTML &escape;-like codes?

No need except for HTML special characters like '<' that you want to appear literally in the documentation. You can use \uABCD sequences inside javadoc comments.
Java process \u.... before parsing the source file so they can appear inside strings, comments, anywhere really. That's why

System.out.println("Hello, world!\u0022);

is a valid Java statement.

/** @return \u03b8 in radians */

is equivalent to

/** @return θ in radians */

as far as javadoc is concerned.


But what would be the Java // comment equivalent?

You can use // comments in java but Javadoc only looks inside /**...*/ comments for documentation. // comments are not metadata carrying.

One ramification of Java's handling of \uABCD sequences is that although

// Comment text.\u000A System.out.println("Not really comment text");

looks like a single line comment, and many IDEs will highlight it as such, it is not.

Java doesn't recognize Unicode character in path on Windows 11

Using your directory name (Otávio Augusto Silva), I can reproduce your problem on Windows 10 as well, using Java 18. Unfortunately, this looks like a specific example of a more general and longstanding problem documented in this open and unresolved JDK bug:

JDK-4488646 Java executable and System properties need to support Unicode on Windows

This is part of the bug report's description, with my emphasis added:

To make Java completely Unicode-aware on NT we need to

  1. Modify System properties initialization code and all other places
    where Windows calls are used to use wide-char calls on NT.

  2. Modify java, javac etc. to be able to use Unicode in classpath and
    other command line arguments
    .

That bug report was created in 2001! It relates to Windows NT, but since it remains open and unresolved I assume it has general applicability for all flavors of Windows, including Windows 10 and 11.

Notes:

  • Although it doesn't help to resolve your specific problem, it is fairly straightforward "to use wide-char calls" within your Java application (as mentioned in the bug description above) using JNA. For example, your code could successfully process Otávio Augusto Silva if it was passed an argument to your application from Java. See this SO answer for the code to do that.

  • Also see open and unresolved JDK bug report JDK-8124977 cmdline encoding challenges on Windows which was raised in 2015. It includes some discussion on the differences between using java from cmd and PowerShell windows on Windows.

========================================================

(This update is based on comments from @user16320675.)

It seems the issue is fully resolved in Java 19 (download from here) which is due to be released later this month. From the screen shot below:

  • The call to javac will succeed when using JDK 19.

  • The same call to javac will fail when using JDK 18, because the file name D:\Otávio... is processed as D:\Otávio....

    javac calls

I can't find any mention of this fix in the JDK 19 Release Notes.

========================================================

(This update shows what happens if the beta option is not enabled.)

If the option Beta: Use Unicode UTF-8 for worldwide language support is not enabled I cannot reproduce the problem; the call to javac works fine using both JDK 18 and JDK 19:

Beta option not enabled

Note that this works even though the code page in the cmd window is 437, not 65001. Of course there are a couple of significant differences between your environment and mine:

  • You are using Windows 11 and I am using Windows 10.
  • My system locale is English (United States), and I assume that yours is different.

To summarize how to resolve this issue:

  • Unless you have that beta option enabled for some specific reason, consider just disabling it.
  • If you want to keep the option enabled, consider upgrading to Java 19.

========================================================

Update: The following bug was fixed in Java 19:

8272352: Java launcher can not parse Chinese character when system locale is set to UTF-8 #530

Although that bug fix specifically relates to file names passed to java, I think it probably explains why the OP's problem with javac is also resolved in Java 19.



Related Topics



Leave a reply



Submit