How to Find the Default Charset/Encoding in Java

How to Find the Default Charset/Encoding in Java?

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

Here are my results:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

(update)

Ok. I did reproduce your bug with JVM 1.5.

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

JVM 1.5:

public static Charset defaultCharset() {
synchronized (Charset.class) {
if (defaultCharset == null) {
java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String) AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
return cs;
return forName("UTF-8");
}
return defaultCharset;
}
}

JVM 1.6:

public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String) AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

As for why the IO classes such as OutputStreamWriter return an unexpected result,

the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses its own cache of the default charset that is set upon JVM initialization:

JVM 1.6:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
Object lock,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
csn = Charset.defaultCharset().name();
try {
if (Charset.isSupported(csn))
return new StreamEncoder(out, lock, Charset.forName(csn));
} catch (IllegalCharsetNameException x) { }
throw new UnsupportedEncodingException (csn);
}

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
Object lock,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
try {
if (Charset.isSupported(csn))
return new CharsetSE(out, lock, Charset.forName(csn));
} catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

Determining default character set of platform in Java

It means the default character encoding of the JVM that you're running on,

To check the default encoding you can do the following:

System.getProperty("file.encoding");

that will return the default encoding (and the one used by getBytes() above).

Default charset for file encoding - Java

When I run your program on my Mac with -Dfile.encoding=UTF-16, I get the following output (as a hex dump):

0000000    fe  ff  00  54  00  68  00  65  00  20  00  64  00  65  00  66
0000020 00 61 00 75 00 6c 00 74 00 20 00 63 00 68 00 61
0000040 00 72 00 73 00 65 00 74 00 20 00 69 00 73 00 3a
0000060 00 20 00 55 00 54 00 46 00 2d 00 31 00 36 00 0a

So what is probably happening with you is: setting file.encoding to UTF-16 causes Java to write UTF-16 sequences to the console and your console is not set up to handle UTF-16 output. The first two bytes (which together form the Unicode BYTE ORDER MARK) don't display properly (probably due to your console font and/or driver) and the remaining output is truncated at the first null byte (again, due to your console software).

You can try directing the output of your program to a file and looking at it with a hex editor or something too get a better idea of what's happening.

Platform's default charset on different platforms?

That's a user specific setting. On many modern Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on Windows, it's often CP1250, in Europe it's CP1252. In China, you often find simplified chinese (Big5 or a GB*).

But that’s the system default, which each user can change at any time. Which is probably the solution: Set the encoding when you start your app using the system property file.encoding

See this answer how to do that. I suggest to put this into a small script which starts your app, so the user default isn't tainted.

Setting the default Java character encoding

Unfortunately, the file.encoding property has to be specified as the JVM starts up; by the time your main method is entered, the character encoding used by String.getBytes() and the default constructors of InputStreamReader and OutputStreamWriter has been permanently cached.

As Edward Grech points out, in a special case like this, the environment variable JAVA_TOOL_OPTIONS can be used to specify this property, but it's normally done like this:

java -Dfile.encoding=UTF-8 … com.x.Main

Charset.defaultCharset() will reflect changes to the file.encoding property, but most of the code in the core Java libraries that need to determine the default character encoding do not use this mechanism.

When you are encoding or decoding, you can query the file.encoding property or Charset.defaultCharset() to find the current default encoding, and use the appropriate method or constructor overload to specify it.

Java : How to determine the correct charset encoding of a stream

I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet

How to check the charset of string in Java?

Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form.
You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

There are plenty of other charset detectors out there as well



Related Topics



Leave a reply



Submit