How to Configure Encoding in Maven

How to configure encoding in Maven?

OK, I have found the problem.

I use some reporting plugins. In the documentation of the failsafe-maven-plugin I found, that the <encoding> configuration - of course - uses ${project.reporting.outputEncoding} by default.

So I added the property as a child element of the project element and everything is fine now:

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>

See also http://maven.apache.org/general.html#encoding-warning

Maven UTF-8 encoding issue

You're missing the encoding from your new String() constructor, so it's using the default encoding of your platform which isn't UTF-8 (looks like some variant of ISO-8859-1).

If you use the following code (which doesn't make much sense, but shows the default encoding botching things), you'll see that it's printed properly everywhere.

String myString = "Türkçe Karakter Testi : ğüşiöçĞÜİŞÇÖĞ";
String value = new String(myString.getBytes("UTF-8"), "UTF-8");
System.out.println(value);

What's the lesson here? Always specify the encoding to use when dealing with byte/character conversion! This includes such methods as String.getBytes(), new String() and new InputStreamReader().

This is just one of the many ways that character encoding can bite you in the behind. It may seem like a simple problem, but it catches unsuspecting developers all the time.

Maven change encoding to certain files

[10,19] means: The 19th character on the 10th line.

@VGR explained precisely why reading UTF-8 encoded source files in as CP1252 causes compilation to fail: Any non-ASCII character is encoded as at least 2 bytes in UTF-8. If you then incorrectly read those bytes as Cp1252, you get 2 or more gobbledygook characters. Given that char literals only allow 1 character inside, the code now has compiler errors in it.

There's no way to tell maven that some files are UTF-8 and some files are Cp1252 unless you run separate compilation runs, which is hard to do, would be very confusing and hard to maintain (so, a bad idea), and can't work at all unless you either involve stubs or you're 'lucky' and one of the two batches is 'self contained' (contains absolutely no reference to anything from the other 'batch').

So let's get rid of that as feasible option. That leaves 2 options:

The right option - all UTF-8, all the time

Treat all source files as UTF-8. This is easier than it sounds like; all ASCII characters are encoded identically in UTF-8 and Cp1252, so only non-ASCII characters need to be reviewed. This is easy to find: It's, effectively, all bytes above 126. You can use many tools to go find these. Here is an SO question with answers about how to do this on linux, for example.

Open those files with any editor that makes it clear which encoding it is using (most developer editors do this right), reload with encodings until this character looks correct, then re-save as UTF-8, voila. All the ones with no special characters are both UTF-8 and Cp1252 at the same time - you can simply compile them using UTF-8 encoding and it'll just work fine.

Now all your code is in UTF_8. Configure your IDE project accordingly / just leave your maven pom on 'it is UTF-8' and all maven-aware project tools will pick up on this automatically.

Considerably worse option - backslash-u escaping

If you can't do that because some tools read those source files (not maven and javac and in fact pretty much nothing major from the java ecosystem as the java ecosystem is all quite UTF-8 aware) and just insist on parsing it out as Cp1252, nothing you can do about it: There is a way to remove all non-ASCII from source files: backslash-u escapes.

The concept \u0123 is legal anywhere in any java file, and not just in string literals. It means: The unicode character with that value (in hex). For example, this:

class Test {
public static void main(String[] args) {
//This does nothing, right? \u000aSystem.out.println("Hello!");
}
}

When you run it, actually prints Hello!. Even though the sysout is in a comment... or is it?

\u000a is the newline symbol. So, the above file is parsed out as a comment on one line, then a newline, so, that System.out statement really is in there and isn't in a comment. Many tools don't know this (e.g. sublime text and co will render that sysout statement in commenty green), but javac and, in fact, the Java Lang Spec is crystal clear on this: The above code has a real print statement in there, not commented out.

Thus, you can go hunt for all non-ASCII and replace it with u escapes, and now your code is hybridized: It parses identically regardless of which encoding you use, as long as it's an ASCII compatible encoding, and almost all encodings are (only a few japanese and other east asian charsets, as well as UTF-16/UCS2/UCS4/UTF-32 style encodings, are non-ASCII compatible. Cp1252, Iso-8859, UTF_8 itself, ASCII itself, Cp850, and many many others are 'ASCII compatible', meaning, 100% ASCII text is identically encoded by all these encodings).

To turn things into u escapes, look up the hexadecimal value of the symbol in any unicode website and apply it. For example, é becomes \u00E9 (see é) and ☃ becomes \u2603 (see unicode snowman).

Put those escapes in where-ever you see non-ascii in a source file, even if you see it outside of string literals:

legal java:

public class Fighter {
public void mêléeAttack() {}
}

But.. if you mix up the encoding setting in your editor and the encoding setting in maven that goes badly. However, this:

public class Fighter {
public void m\u00EAl\u00E9eeAttack() {}
}

means the same thing, and works correctly even if you mess up encodings. It just looks real bad in your editors, which is why this is a considerably worse option.

MVN compile not using UTF-8 encoding

Having had the same problem I came across this question first. But as there was no answer I looked further and found another question that was answered and helped me solving this:

https://stackoverflow.com/a/10375505/332248 (credit to @chrisapotek and @Jopp Eggen for answering this)

Maven: Source Encoding in UTF-8 not working?

I have found a "solution" myself:

I had to pass the encoding into the maven-surefire-plugin, but the usual

<encoding>${project.build.sourceEncoding}</encoding>

did not work. I still have no idea why, but when i pass the command line arguments into the plugin, the tests works as they should:

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.15</version>
<configuration>
<argLine>-Dfile.encoding=UTF-8</argLine>
</configuration>
</plugin>

Thanks for all your responses and additional comments!



Related Topics



Leave a reply



Submit