File.Encoding Has No Effect, Lc_All Environment Variable Does It

file.encoding has no effect, LC_ALL environment variable does it

Note: So finally I think that I have nailed it down. I am not confirming that it is right. But with some code reading and tests this is what I found out and I don't have additional time to look into it. If anyone is interested they can check it out and tell if this answer is right or wrong - I would be glad :)

The reference I used was from this tarball available at OpenJDK's site:
openjdk-6-src-b25-01_may_2012.tar.gz

Java natively translates all string to platform's local encoding in this method: jdk/src/share/native/common/jni_util.c - JNU_GetStringPlatformChars() . System property sun.jnu.encoding is used to determine the platform's encoding.
The value of sun.jnu.encoding is set at jdk/src/solaris/native/java/lang/java_props_md.c - GetJavaProperties() using setlocale() method of libc. Environment variable LC_ALL is used to set the value of sun.jnu.encoding. Value given at the command prompt using -Dsun.jnu.encoding option to Java is ignored.
Call to File.exists() has been coded in file jdk/src/share/classes/java/io/File.java and it returns as
return ((fs.getBooleanAttributes(this) & FileSystem.BA_EXISTS) != 0);
getBooleanAttributes() is natively coded (and I am skipping steps in code browsing through many files) in jdk/src/share/native/java/io/UnixFileSystem_md.c in function :
Java_java_io_UnixFileSystem_getBooleanAttributes0(). Here the macro
WITH_FIELD_PLATFORM_STRING(env, file, ids.path, path) converts path string to platform's encoding.
So conversion to wrong encoding will actually send a wrong C string (char array) to subsequent call to stat() method. And it will return with result that file cannot be found.

LESSON: LC_ALL is very important

How to better setting up JVM encoding properties to UTF-8

We can encode the source encoding and output encoding by passing runtime arguments to command as follows:

mvn -Dproject.build.sourceEncoding=UTF-8 -Dproject.reporting.outputEncoding=UTF-8 clean deploy

Or by adding line in pom.xml:

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <redis.version>1.3.5.RELEASE</redis.version>
</properties>

Chef ENV settings not working

I found that the solution that worked for me was to either in the bootstrap shell script, or as inline shell, to copy the /etc/default/lang.sh to the box prior to any recipes being run. (So should be first thing done in the Vagrant file after box definitions)
lang file:

export LANGUAGE=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

From here the database should get setup with the UTF-8 encoding.
Hope this helps as I have spent days searching for solutions to this, and came up with the bits and pieces from various discussions, but realized that the problem was timing of when the values are set...

Wrong File Encoding in JVM after Linux Update

thanks to icza. I googled a little for JAVA_OPTS, and found, that i should use JAVA_TOOL_OPTIONS instead.
see How do I use the JAVA_OPTS environment variable?

or _JAVA_OPTIONS:
Running java with JAVA_OPTS env variable

both are working just fine, for runtime and compiler

>export JAVA_TOOL_OPTIONS=-Dfile.encoding=ISO8859-1
>java Test
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=ISO8859-1
ISO8859-1

>javac Test.java
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=ISO8859-1

>export _JAVA_OPTIONS=-Dfile.encoding=ISO8859-1
>java Test
Picked up _JAVA_OPTIONS: -Dfile.encoding=ISO8859-1
ISO8859-1

>javac Test.java
Picked up _JAVA_OPTIONS: -Dfile.encoding=ISO8859-1

How to specify a char set for file name (not content) in Java?

The portable Java API does not have a concept of a file system character encoding, as that wouldn't be portable: Windows e.g. saves file names as unicode no matter the locale. On Linux, however, the LC_CTYPE facet of your locale determines the encoding of the file system. So by exporting LC_CTYPE=en_US.utf8 or similar to the environment before you launch your Java application, your application will use that for file name handling.

Also see file.encoding has no effect, LC_ALL environment variable does it which talks about some of the internals behind this conversion.

Character encoding in R

I found the answer my self. The problem was with the transformantion from UTF-8 to the system locale (the default encoding in R) through fileEncoding. As I use RStudio, I just changed the default encoding to UTF-8 and removed the fileEncoding="UTF-8-BOM" from read.csv. Then, the entire csv file was read and RStudio displays all characters correctly.

Why doesn't Encoding.default_external respect LANG?

I figured it out. Not only does the LANG environment variable need to be set, but the locale it species must have been generated for the OS. On a stock Linux image, the default locale may be something that is not UTF-8. In my particular case, I'm using Debian 7.7 and the default locale is "POSIX". I was able to set the default locale by installing the locales package and following the interactive prompts to generate the en_US.UTF-8 locale:

$ apt-get -y install locales

If the locales package is already installed, you can just reconfigure it instead:

$ dpkg-reconfigure locales

Now setting LANG will change the current system locale, and Ruby's Encoding.default_external will be set properly:

$ export LANG=en_US.UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ irb
irb(main):001:0> Encoding.default_external
=> #<Encoding:UTF-8>

For an example of how to automate the generation and configuration of the default locale instead of doing it interactively, take a look at this Docker image.

File.Encoding Has No Effect, Lc_All Environment Variable Does It