Passing Command Line Unicode Argument to Java Code

what is the character encoding used in eclipse vm arguement?

My conclusion is the conversion depended on default encoding(Windows setting "Language for non-Unicode programs")
Here is the program for testing:

package test;
import java.io.FileOutputStream;
public class Test {
public static void main(String[] args) throws Exception {
StringBuilder sb = new StringBuilder();
sb.append("[카운터] sysprop=[").append(System.getProperty("cenv"));
if (args.length > 0) {
sb.append("], cmd args=[").append(args[0]);
}
sb.append("], file.encoding=").append(System.getProperty("file.encoding"));
FileOutputStream fout = new FileOutputStream("/testout");
fout.write(sb.toString().getBytes("UTF-8"));
fout.close();//write result to a file instead of System.out
//Thread.sleep(10000);//For checking arguments using Process Explorer
}
}

Test1: "Language for non-Unicode programs" is Korean(Korea)

Exceute in command prompt: java -Dcenv=카운터 test.Test 카운터(Korean chars are correct when I verify the arguments using Process Explorer)

Result: [카운터] sysprop=[카운터], cmd args=[카운터], file.encoding=MS949

Test2: "Language for non-Unicode programs" is Chinese(Traditional, Taiwan)

Exceute in command prompt(paste from clipboard): java -Dcenv=카운터 test.Test 카운터(I cannot see Korean chars in command windows. However, Korean chars are correct when I verify the arguments using Process Explorer)

Result: [카운터] sysprop=[???], cmd args=[???], file.encoding=MS950

Test3: "Language for non-Unicode programs" is Chinese(Traditional, Taiwan)

Launch from Eclipse by setting Program arguments and VM arguments (The command line in Process Explorer is C:\pg\jdk160\bin\javaw.exe -agentlib:jdwp=transport=dt_socket,suspend=y,address=localhost:50672 -Dcenv=카운터 -Dfile.encoding=UTF-8 -classpath S:\ws\wtest\bin test.Test 카운터 This is the same as you see in the Properties dialog of Eclipse Debug view)

Result: [카운터] sysprop=[???], cmd args=[bin], file.encoding=UTF-8

Change the Korean chars to "碁石",which exist in MS950/MS949 charset:

  • Test1 Result: [碁石] sysprop=[碁石], cmd args=[碁石], file.encoding=MS949
  • Test2 Result: [碁石] sysprop=[碁石], cmd args=[碁石], file.encoding=MS950
  • Test3 Result: [碁石] sysprop=[碁石], cmd args=[碁石], file.encoding=UTF-8

Change the Korean chars to "鈥焢",which exist in MS950 charset:

  • Test1 Result: [鈥焢] sysprop=[??], cmd args=[??], file.encoding=MS949
  • Test2 Result: [鈥焢] sysprop=[鈥焢], cmd args=[鈥焢], file.encoding=MS950
  • Test3 Result: [鈥焢] sysprop=[鈥焢], cmd args=[鈥焢], file.encoding=UTF-8

Change the Korean chars to "宽广",which exist in GBK charset:

  • Test1 Result: [宽广] sysprop=[??], cmd args=[??], file.encoding=MS949
  • Test2 Result: [宽广] sysprop=[??], cmd args=[??], file.encoding=MS950
  • Test3 Result: [宽广] sysprop=[??], cmd args=[??], file.encoding=UTF-8
  • Test4: to verify my assumption, I change "Language for non-Unicode programs" to Chinese(Simplified, PRC) and exceute java -Dcenv=宽广 test.Test 宽广 in command prompt

    Result: [宽广] sysprop=[宽广], cmd args=[宽广], file.encoding=GBK

During testing, I always check the command line via Process Explorer, and make sure all chars are correct.
However, the command argument chars are converted using default encoding before invoke main(String[] args) of Java class. If one of char does not exist in the charset of default encoding, the program will get unexpected argument.

I'm not sure the problem is caused by java.exe/javaw.exe or Windows. But passing non-ASCII parameter via command arguments is not a good idea.

BTW, I also try to execute the command via .bat file(file encoding is UTF-8). Maybe someone is interest,

Test5: "Language for non-Unicode programs" is Korean(Korea)

The command line in Process Explorer is java -Dcenv=移댁슫?? test.Test 移댁슫??(The Korean chars are collapsed)

Result: [카운터] sysprop=[移댁슫??], cmd args=[移댁슫??], file.encoding=MS949

Test6: "Language for non-Unicode programs" is Korean(Korea)

Add another VM arguments. The command line in Process Explorer is java -Dfile.encoding=UTF-8 -Dcenv=移댁슫?? test.Test 移댁슫??(The Korean chars are collapsed)

Result: [카운터] sysprop=[移댁슫??], cmd args=[移댁슫??], file.encoding=UTF-8

Test7: "Language for non-Unicode programs" is Chinese(Traditional, Taiwan)

The command line in Process Explorer is java -cp s:\ws\wtest\bin -Dcenv=儦渥?? test.Test 儦渥??(The Korean chars are collapsed)

Result: [카운터] sysprop=[儦渥??], cmd args=[儦渥??], file.encoding=MS950

Java: Runtime.getRuntime().exec() passes arguments in unicode when it shouldn't

Found it. The behaviour is controlled by the system property file.encoding. Netbeans sets it to UTF-8. In the command line, it's ISO-8859-15.

Processing unicode characters in command line arguments

I am not sure exactly at which point this happens, but the command line parameters apparently are expected to contain only ASCII characters, and to decode the byte array to a string, bytes.decode(encoding, errors) is used:

param = b'\xc4\x81'.decode('ASCII', 'surrogateescape')
print(param == '\udcc4\udc81') # True

When the decoder stumbles upon a non-ASCII character, it processes the decoding according to the selected error handler. In this case, surrogateescape error handler replaces byte with individual surrogate code ranging from U+DC80 to U+DCFF.

So, the way to fix this is to encode the incorrectly decoded string back to byte array, using the same surrogateescape error handler, and then decode it as utf-8:

import sys
param = sys.argv[1]
param_unicode = param.encode('ASCII', 'surrogateescape').decode('utf-8')
print(param_unicode)

$ python test.py ā
ā

It should be verified, though, if the command line parameters really are always decoded using the ASCII encoding. Perhaps it is different on different platforms and is configurable.

How to send UTF-8 command line data from PHP to Java for correct encoding

The problem has been solved using the solution provided here:

Unicode to PHP exec

Everyone's help got me on the right track. It was indeed a locale issue, but not at the OS level. Instead it was with PHP's locale.

Another user had a similar issue and it was fixed with by adding the following code to the PHP script before executing the command line that calls the Java program:

$locale = 'en_US.utf-8';
setlocale(LC_ALL, $locale);
putenv('LC_ALL='.$locale);

So now, in the Java code, when I view the args[0] param, that is now displayed correctly and also the processed text stored in a file and then sent back to and received into the PHP script properly. It took a bit of looking up the byte values, corresponding UTF-8 encodings, and the like before I could start to see the issue was that PHP was translating what was a correct string just before exec, into a different string during the exec() call. During this call the UTF-8 \0xc3 0xa9 bytes for "é" (Unicode \u00E9) into \3f \3f (two ASCII question mark chars).

During my searching here on stackoverflow I saw a warning not the use literals (e.g. "Présentation") and once I backtracked the data to the caller it became evident that the issue involved the actual call to exec().

Hopefully another new to Unicode processing can benefit from this information.

Thanks for everyone's input which pointed me in the right direction.



Related Topics



Leave a reply



Submit