How to Compile a Java Source File Which Is Encoded as "Utf-8"

How to compile a java source file which is encoded as UTF-8?

Your file is being read as UTF-8, otherwise a character with value "65279" could never appear. javac expects your source code to be in the platform default encoding, according to the javac documentation:

If -encoding is not specified, the platform default converter is used.

Decimal 65279 is hex FEFF, which is the Unicode Byte Order Mark (BOM). It's unnecessary in UTF-8, because UTF-8 is always encoded as an octet stream and doesn't have endianness issues.

Notepad likes to stick in BOMs even when they're not necessary, but some programs don't like finding them. As others have pointed out, Notepad is not a very good text editor. Switching to a different text editor will almost certainly solve your problem.

What charset to use when reading in a java source file?

Character encodings vary

Any tool can write Java source code in any encoding. Even the idea of .java file is not defined by the Java Language Spec. Any IDE can persist Java source code any way it wants with any encoding.

The tools are responsible for ultimately providing a Unicode-compliant stream of characters into the compiler toolchain. How they collect and persist the source code is up to the particular tools.

The Java Language Specification states in Chapter 3 Lexical Structure:

Programs are written using the Unicode character set. Information about this character set and its associated character encodings may be found at http://www.unicode.org/.

So presumably a Java source code file would use one of character encodings common with Unicode such as UTF-8, UTF-16, or UCS-2.

Section 3.2 Lexical Translations mentions that a Java program could use an encoding such as ASCII by embedding Unicode escapes:

Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx.

While UTF-8 is common in my experience, that is not the only possible encoding. You must know or guess the encoding of any particular source file, and you must account for expanding any Unicode escapes.

Other issues

By the way, note that at least in the Oracle JDK, the byte order mark (BOM) optional to UTF-8 files is not allowed in Java due to a bug (JDK-4508058) that will never be fixed (because of backward-compatibility concerns).

Also note that line terminators may vary: the ASCII characters CR (CARRIAGE RETURN), or LF (LINE FEED), or CR LF.

White space varies: SPACE (SP), CHARACTER TABULATION (HT) (horizontal tab), FORM FEED (FF), and line terminators.

Read the spec for additional details. For example, regarding the SUBSTITUTE character:

As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream.

About character encoding

Be sure you understand the basics of Unicode and of character encoding. Best place to start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.


Even supposed rules such as “one public class per .java file” may be defined by particular tools rather than by Java itself. The CodeWarrior tools for Java way-back-when supported multiple classes per file.

saving Java file in UTF-8

The issue is the setting of file.encoding when you run the program, and the destination of System.out. If System.out is an eclipse console, it may well be set to be UTF-8 eclipse console. If it's just a Windows DOS box, it is a CP1252 code page, and will only display ? in this case.

Java source files - Is encoding still relevant once compiled?

The source code encoding is very relevant while compiling, as the OP says in his post. However after compiling, all literal text is stored as (modified-) UTF-8 encoded strings.

All string literals, class/method/field names and references to them are stored in the constant pool of the .class file in UTF-8 encoding:

From the JVM spec (for Java version 1.7):

4.4.7. The CONSTANT_Utf8_info Structure

The CONSTANT_Utf8_info structure is used to represent constant string
values:

[...]

String content is encoded in modified UTF-8. Modified UTF-8
strings are encoded so that code point sequences that contain only
non-null ASCII characters can be represented using only 1 byte per
code point, but all code points in the Unicode codespace can be
represented.

So once your source code is compiled, it is stored in a known character encoding (UTF-8) and you no longer need to specify the source file encoding.

In general, section 4.4 of the JVM specification explains how the constant pool works and that Strings, class/field/method names etc. are represented by a CONSTANT_Utf8_info structure.

How can I specify the encoding of Java source files?

You can set it when you compile the file with the parameter "-encoding"

http://docs.oracle.com/javase/6/docs/technotes/tools/windows/javac.html

(Java) How to create a JAR file in the command line when code contains UTF-8 characters

Creating JAR file is mainly like archiving the .class files so you should not get any error because of UTF-8 characters in source files.

For resolving compilation error due to UTF-8 characters, use -encoding argument for specifying source files encoding -

javac SourceJavaFile.java -encoding UTF-8

Compile java source with unmappable character using IBM JDK

Here are steps how I fixed my problem:

As only part of source files contains unmappable UTF-8 characters, so we can find all such Java files by increasing the number of javac max errors by specifying a compiler argument when compiling with ant:

<compilearg line="-Xmaxerrs 100000" />

And dump error messages to a file when you call ant command.

ant -buildfile=compile.xml > error.txt

Then you can use Notepad++ to do some trick job on the output file to get the list of file which has encoding problem then you can fix them.

  1. Use regular expression to remove unneeded content; and
  2. Use TextFX to sort and remove duplicate lines.


Related Topics



Leave a reply



Submit